Skip to main content
Filter by
Sorted by
Tagged with
Best practices
1 vote
0 replies
28 views

looking for advice on a workflow problem rather than a specific library issue. The situation I have a notebook that: Installs packages via pip in Cell 1 # REVISED CELL 1 import subprocess, sys # ...
Sanjid Hasan's user avatar
0 votes
0 answers
63 views

I'm trying to take an OpenGL storage buffer, load data into it from the CPU (persistently mapping it), taking that data and processing it with cuda, then using that same data again in further OpenGL ...
Krupip's user avatar
  • 5,608
Best practices
0 votes
0 replies
34 views

A channelizer is, conceptually, a bunch of reduction operations (polyphase filter) followed by an equal number of transforms (FFT). Though I have this "mostly working", there are still some ...
Jim Clay's user avatar
  • 141
-2 votes
1 answer
67 views

Lets say there are 10 threads writing and 10 threads reading from the same 32 bit integer stored in global memory, in device code, all at the same time. Are there any guarantees provided about the ...
Box Box Box Box's user avatar
2 votes
1 answer
147 views

I'm trying to create a pointer like type wrapper for CUDA device pointers, but ran into a problem I can't reproduce with other compilers in regular MSVC or GCC. Basically, depending on the order I ...
Krupip's user avatar
  • 5,608
1 vote
1 answer
103 views

I'm attempting to create a function that handles bit casting a float/double/float16 to an unsigned key of equivalent size. I can do this in host code easily with template<std::floating_point T> ...
Krupip's user avatar
  • 5,608
2 votes
2 answers
139 views

#include <stdio.h> #include <thrust/complex.h> #include <thrust/device_vector.h> // #include <thrust/host_vector.h> int main(int argc, char *argv[]) { thrust::device_vector&...
Jim Clay's user avatar
  • 141
0 votes
0 answers
51 views

I'm using cub::BlockHistogram<std::uint16_t, 256, items_per_thread, 256, cub::BLOCK_HISTO_ATOMIC>; and I was expecting an API with similar utility to cub::Warp* functions, however there doesn't ...
Krupip's user avatar
  • 5,608
Advice
0 votes
1 replies
45 views

Background I have an image that I'm splitting into tiles of some size, determined by parameters tile_x, and tile_y, so tile_x = 2 and tile_y = 2 would mean the image is split into a 2x2 grid of tiles (...
Krupip's user avatar
  • 5,608
1 vote
2 answers
136 views

I am trying to load Qwen on google colab. Concurrently asked this question on github bitsandbytes foundation -> https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1905#issuecomment-...
tanvir's user avatar
  • 53
-2 votes
0 answers
135 views

I'm trying to run model on GPU: clf2 = PunctCapSegModelONNX.from_pretrained( "1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", ort_provider=["CUDAExecutionProvider&...
xæliudzyh's user avatar
0 votes
1 answer
104 views

The CUDA function providing a quiet NaN is: __device__ double nan(const char *tagp) And the documentation says that Argument tagp selects one of the possible representations. My question: What does ...
einpoklum's user avatar
  • 139k
2 votes
0 answers
116 views

I'm testing __float128 on CUDA with a simple program: // nvcc -arch=sm_100 test_fp128_add.cu -o test_fp128_add #include <cstdio> #include <cuda_runtime.h> __global__ void add_fp128(...
Rahn's user avatar
  • 5,605
4 votes
1 answer
125 views

In the Ampere architecture, consider the following scenarios: A single warp executes two load instructions: one from Shared Memory and one from DRAM. Two warps within the same SM, each executing a ...
Gal Avineri's user avatar
Advice
0 votes
0 replies
92 views

I'm trying to deploy a PyTorch model to an edge device (Jetson Orin Nano). Deploying the PyTorch model directly does not satisfy the requirements posed by the clients, so I'm looking into optimize the ...
Trung's user avatar
  • 1
Advice
0 votes
0 replies
84 views

I’m was trying to benchmark the torch.nn.LayerNorm on an A100 to compare it against a custom CUDA kernel I made. I expected the PyTorch kernel to get much closer to A100’s peak memory bandwidth (which ...
eltonjohnfanboy's user avatar
Advice
1 vote
0 replies
64 views

I'm new to deep learning and I would like to install CUDA using Conda, but I'm confused about the correct order of the steps and which commands I should use. I’ve seen very different commands in ...
Balázs Tobak's user avatar
Advice
0 votes
2 replies
74 views

If I multiply two 8192 x 8192 matrices of float's with CUBLAS, ncu --metrics dram__bytes_read.sum tells me it reads 4.42 GB of data in total (on a 3070). One matrix is 0.268 GB, so we read each matrix ...
asdfldsfdfjjfddjf's user avatar
Best practices
0 votes
1 replies
115 views

“I’m fine-tuning a transformer with batch size 8 and getting CUDA out-of-memory errors. Would gradient checkpointing or mixed precision help?”
Colin Leede's user avatar
0 votes
0 answers
90 views

Below is a small C++ program which queries the number of devices, gets the first device's properties, then tries to cudaSetDevice() to it. I build this program one one machine, then copy it to a ...
einpoklum's user avatar
  • 139k
0 votes
0 answers
44 views

For educational purposes, I am assembling my own CUDA fatbins by hand and executing them using the CUDA Driver API. The goal is to launch the kernel and display the result of the kernel on the CPU, ...
Ryunosuke Yanashita's user avatar
4 votes
0 answers
72 views

Hi all — I’m debugging an unexpected ordering/progress issue with peer-to-peer device copies and I’d like help understanding what CUDA/driver/runtime behavior could explain it. Setup Single node, 2 ...
Abhishek Ghosh's user avatar
2 votes
1 answer
155 views

I would like to create a struct that can work on the host and device that stores an array that has been dynamically allocated (unknown size at compile time). This struct would be sent to a kernel so ...
Programing Bear's user avatar
0 votes
0 answers
66 views

I want to get the kernel template that cupy uses for its subtract function to compare with my (really simple) kernel: subtract_kernel = cp.RawKernel( """ extern "C" __global__ ...
mle's user avatar
  • 101
0 votes
0 answers
122 views

I have a kernel which first reads values from certain memory locations, then writes to those memory locations. I also have a lock which ensures that at any point in time, only one thread is writing to ...
Yoogi Kovendhan's user avatar
2 votes
0 answers
71 views

I've been implementing radix sort for SPH simulation use for some time now, and initially I've had increased to to use 4 bits radix and make it 16 way radix sort which I haven't seen documented too ...
harism's user avatar
  • 6,113
3 votes
1 answer
609 views

I am trying to install llama-cpp-python with GPU support. I installed Nvidia CUDA Toolkit v13.1, nvidia-smi shows that my graphics card - Geforce GTX 1050 Ti - supports CUDA v13, nvcc is installed ...
Стебан's user avatar
2 votes
1 answer
154 views

Suppose I get a pointer, and I want to determine whether it's "CUDA-associated", i.e. allocated by CUDA as pinned host-side memory, device-side memory, managed memory, array memory, etc. - ...
einpoklum's user avatar
  • 139k
1 vote
1 answer
360 views

I am trying to set up a docker container using the nvidia container toolkit on a remote server, so that I can run cuda programs developed with the Futhark Programming Language - however, the issue ...
Artemijo5's user avatar
2 votes
1 answer
231 views

I'm trying to configure a project of mine involving CUDA, like so: cmake \ -DCMAKE_CUDA_ARCHITECTURES=61 \ -DCMAKE_BUILD_TYPE=Release \ -DCUDAToolkit_ROOT=/usr/local/cuda-11.6 \ -DCMAKE_CUDA_COMPILER=/...
einpoklum's user avatar
  • 139k
Best practices
0 votes
2 replies
71 views

I want to implement a container similar to std::pmr::vector, but backed by CUDA device memory, with usage semantics close to std::pmr::vector / thrust::device_vector. Requirements: PMR-style ...
Savrona's user avatar
  • 408
0 votes
2 answers
75 views

I have a problem with pycuda. I used it for a python script i develop. I know this script work because i use it on other server. But on a specific server i got a problem : >>> import pycuda....
Julien's user avatar
  • 1
4 votes
1 answer
205 views

This is my first time working with CUDA programs. So I just wrote a simple hello world program. #include <stdio.h> __global__ void hello(){ printf("Hello block: %u and thread: %u\n"...
Alvin Alex's user avatar
-5 votes
1 answer
86 views

I tested the performance of LAMMPS with DeepMD-kit for MD simulations on an HPC cluster. The job was allocated 8 CPUs, 64 GB of RAM, and one A100 GPU. I observed that when running with mpirun -np 1 ...
link89's user avatar
  • 2,035
Advice
1 vote
6 replies
197 views

i was training AI in my pc using tensorflow with CPU due i have a amd GPU, well i bought a Nvidia RTX 5060 TI 16GB and i couldn't make it work. First, i made a new environment with anaconda, installed ...
José Timaure's user avatar
3 votes
0 answers
263 views

I am benchmarking several very simple CUDA SGEMM kernels on an NVIDIA Hopper GPU (H800, sm_90), and I observed something that I do not fully understand. I have two kernels that, to my understanding, ...
plznobug's user avatar
  • 123
1 vote
0 answers
143 views

I'm new to CUDA, and I can't see where's the mistake in my kernel. Upon comparing it with the result from my professor, the difference was extremely small, with the pixel average being off by 0,0039. ...
peng33's user avatar
  • 11
1 vote
0 answers
131 views

I have very simple class using thrust device allocator. I also have class on host side uses std::allocator. It works fine. But this one gives segmentation fault. I am not sure what is wrong here. How ...
Savrona's user avatar
  • 408
3 votes
1 answer
141 views

I am learning CuTe's thread value layout, and I followed leimao's blog: https://leimao.github.io/blog/CuTe-Thread-Value-Layout/ I want to figure out which thread read which part of data in matrix. So ...
Damons's user avatar
  • 363
-6 votes
1 answer
160 views

I have CUDA installed via the regular Windows downloadable installer via the official website, and am trying to use PyTorch in the PyCharm program using CUDA as kernel. PyTorch now works fine, however ...
alexanderjansma's user avatar
Advice
1 vote
5 replies
132 views

So I'm trying to learn CUDA C. I had an idea for a simple code that could calculate the simple average of a float array. The idea is that main() will call a host function get_average(), which will ...
bob.sacamento's user avatar
5 votes
2 answers
482 views

Problem I'm trying to use clangd for LSP in Neovim with CUDA .cu files, but it fails to recognize standard C++ library features on the host side. Even simple host functions using std::format, std::...
NeKon's user avatar
  • 312
3 votes
1 answer
112 views

Can I modify host data in host_data_ptr after the following ? cudaMemcpyAsync(device_data_ptr, host_data_ptr, size, cudaMemcpyHostToDevice, ...
YSF's user avatar
  • 41
3 votes
1 answer
2k views

I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...
PowerPoint Trenton's user avatar
3 votes
1 answer
128 views

I am currently attempting to use the thrust::remove function on a thrust::device_vector of structs in my main function as shown bellow: #include <iostream> #include <thrust/device_vector.h>...
AowynB's user avatar
  • 33
-2 votes
1 answer
132 views

I have a coded my simple CUDA ZIP password cracker but it seems that it prints same password for a number of times and i couldn't figure out why and this is weighing down my program. Here is the full ...
actgroup inc's user avatar
0 votes
1 answer
380 views

I'm trying to build, using CMake, a program involving C++ and CUDA-C++ code. It used to build file, several months ago, but - now am getting a linker error I'm not familiar with: in function `main....
einpoklum's user avatar
  • 139k
4 votes
2 answers
237 views

I am trying to run basic CUDA program in google colab but its not giving kernel output. Below are the steps what I tried: Changed run type to T4 GPU. !pip install nvcc4jupyter %load_ext ...
Digvijay Singh Thakur's user avatar
1 vote
1 answer
128 views

I want to create a skeleton for a project in which there are multiple cuda and cpp files. They will be compiled individually and then linked together to form a single executable. Currently I have the ...
ThErOmAnEmPiRe's user avatar
1 vote
1 answer
91 views

I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022. I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....
Imagination Youth's user avatar

1
2 3 4 5
296