Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
171 views

If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below. This code prefetches data within a single warp. I want to expand this, so that I can prefetch data in ...
Johan's user avatar
  • 77.4k
2 votes
1 answer
90 views

I am trying to understand this article : https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda More specifically bank-conflicts is what I am ...
user8469759's user avatar
  • 2,948
0 votes
1 answer
202 views

I am trying to share memory between two Vulkan processes (using the same NVidia gpu device) using the VK_KHR_external memory extension on Linux (Ubuntu 22). I create a buffer/device memory (and ...
pettersson's user avatar
2 votes
0 answers
86 views

Is there an officially sanctioned way to reuse shared data between global functions? Consider the following code https://cuda.godbolt.org/z/KMj9EKKbf: #include <cuda.h> #include <stdio.h> ...
Johan's user avatar
  • 77.4k
-3 votes
1 answer
131 views

I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts: Two-way bank conflict Three-way bank conflict Figure below is how the words are mapped ...
kdh's user avatar
  • 194
0 votes
1 answer
55 views

I am developing a CUDA kernel that takes as input an array input of unsigned char of size n * 57, where n is a kernel argument. NSight Compute reports that the memory accesses are not coalesced, ...
ric's user avatar
  • 101
4 votes
1 answer
135 views

I have been trying to understand how data broadcasting works. In terms of this fact, I have designed two distinct kernel (in the aspect of reading data from shared memory). I have tried compare the ...
log0xFF's user avatar
  • 43
1 vote
1 answer
68 views

The problem I am having is that the matrixMulKernel_tiled kernel function is not performing any summing. I am unsure as to why the output matrix is consistently all zeros. Any thoughts? Please note, ...
Maayan Israel's user avatar
0 votes
1 answer
139 views

Consider the following CUDA kernel that is used in Python via CuPy from the CuPy docs add_kernel = cp.RawKernel(r''' extern "C" __global__ void my_add(const float* x1, const float* x2, float*...
Uwe.Schneider's user avatar
0 votes
2 answers
164 views

I would like to write a CUDA kernel that uses two different (dynamically) shared memory arrays in the following form __global__ myKernel() { extern __shared__ int localSum1[]; extern ...
Uwe.Schneider's user avatar
0 votes
1 answer
161 views

I've queried the CUDA device (T1000 SM_75) and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.) #include <cuda.h> #include &...
sof's user avatar
  • 9,767
1 vote
0 answers
274 views

I am trying to incrementally optimize matrix transpose operation on CUDA and gain some hands on experience. I have tried a few things but the timing measurements that I am getting do not make sense. ...
Saydon's user avatar
  • 27
3 votes
0 answers
385 views

I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single ...
Sunjnn's user avatar
  • 51
1 vote
1 answer
386 views

After just upgrading to CUDA 12.4 and recompiling my project, I got the following warning: Experiments.cu:188:39: warning: ‘cudaError_t cudaFuncSetSharedMemConfig(T*, cudaSharedMemConfig) [with T = ...
Serge Rogatch's user avatar
0 votes
1 answer
376 views

I am writing a CUDA C++ code for image filtering. CUDA separates the image data into blocks for parallel processing. For regular pixel-wise processing of course it is fast. However in image filtering, ...
MeiH's user avatar
  • 1,875
4 votes
1 answer
180 views

I have a fairly large object that I want to load into shared memory so that multiple warps can access the object, something like struct alignas(8) Foo{ int a; float b; vec2 c; uvec2 d; ...
Krupip's user avatar
  • 5,356
0 votes
1 answer
132 views

Wer are using dynamic shared memory in our CUDA kernels. We are setting the size of the shared memory for each kernel using the driver API cuFuncSetAttribute and ...
msedi's user avatar
  • 1,815
2 votes
0 answers
186 views

I know in CUDA programming, memory reads at different levels can overlap. For example, data transfers from global memory to shared memory and from shared memory to registers can overlap. But can read ...
Y. Chen's user avatar
  • 51
2 votes
0 answers
173 views

I expect a cuda shared memory bank conflict in the following two situations: Accessing successive half floats (2 words) with successive threads Accessing vectorized int4 datatypes by successive ...
fabian's user avatar
  • 1,881
0 votes
1 answer
135 views

I sometimes see the following shared memory declaration in CUDA kernels, and I am not sure what it means: extern __shared__ T shmem[][SZ] with SZ being a compile-time constant. The kernel is launched ...
fabian's user avatar
  • 1,881
0 votes
0 answers
77 views

The next kernel performs the multiplication of the matrices matA and matB and stores the result in the matrix matC (the size of all matrices is N) using a shared memory region with dimensions tiledim ...
Athanasios Margaris's user avatar
0 votes
1 answer
367 views

I assume in context to data fetch for CUDA core - registers are the fastest, next shared memory , then L1 cache, next L2 cache and then global memory is the slowest. I assume in a GPU data moves in ...
user2166888's user avatar
1 vote
0 answers
154 views

I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication: gemm3: baseline of shared memory GEMM gemm4: less thread blocks in x dimension gemm5: less blocks in both ...
Worldbuffer's user avatar
1 vote
0 answers
106 views

As the trick described in here, I tested the following code and got the corresponding profiling result. Conflicts were notably diminished, but some still persist. // store conflict __global__ void ...
picklesmithy129's user avatar
0 votes
1 answer
73 views

I am trying to transpose a matrix. It works as expected for some values and starts crashing with bigger ones or even between executions of the program. What I am trying to make is to split the matrix ...
BrightSoul's user avatar
2 votes
0 answers
107 views

I am making an particle-based code in which every particle's neighbor lists are already generated with the initial condition and unchanged during the whole simulation, but the positions of each ...
Sangjun Lee's user avatar
8 votes
1 answer
7k views

From the CUDA Programming Guide: [Warp shuffle functions] exchange a variable between threads within a warp. I understand that this is an alternative to shared memory, thus it's being used for ...
gonidelis's user avatar
  • 1,125
1 vote
1 answer
1k views

I have written a simple CUDA program to perform array reduction using thread block clusters and distributed shared memory. I am compiling it with CUDA 12.0 and running on a hopper GPU. Below is the ...
Ricky Dev's user avatar
0 votes
1 answer
135 views

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 ...
Ferdinand Mom's user avatar
-2 votes
1 answer
92 views

I had a bug in my cuda kernel code. I managed to know where is the bug and fix it, but it would be nice to know what caused it exactly. I am saving two numbers (p_x, p_y) [3D array XY sizes] in a ...
adnan's user avatar
  • 1
7 votes
3 answers
3k views

I am currently studying CUDA and learned that there are global memory and shared memory. I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/...
Tae's user avatar
  • 125
1 vote
1 answer
3k views

I'm currently learning how to write fast CUDA kernels. I implemented a tiled matrix multiplication (block size 32x32) which only does coalesc reads/writes from/to global memory and has no bank ...
chuber's user avatar
  • 11
2 votes
0 answers
644 views

I faced a problem on training stylegan3 where the terminal stuck at "Setting up PyTorch plugin "upfirdn2d_plugin"... ". I have tried all the methods I found, such as reinstall ...
Fu Wenjin's user avatar
2 votes
1 answer
465 views

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA. On the one hand, it seems that any load/store is compiled to ...
Pierre T.'s user avatar
  • 388
0 votes
1 answer
2k views

In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads. However it doesn't state what happens if the ...
Niteya Shah's user avatar
  • 1,824
2 votes
0 answers
1k views

I'm looking for a way to benchmark shared memory and L1/L2 cache. However, the benchmark results I found are very different depending on the source. In this paper, Dissecting the NVIDIA Volta GPU ...
Huy Le's user avatar
  • 1,999
0 votes
1 answer
2k views

My kernel allocated a shared memory for data storage, but bug reports if I change the size of the shared memory, see codes attached. #include <stdio.h> #include <assert.h> #define ...
Mangoccc's user avatar
0 votes
1 answer
717 views

If all threads in same block visit the same address i.e. array[0] for some old compute capability, there is a bank conflict. But does this conflict still exist for the latest compute capabilities (i.e....
cctv's user avatar
  • 19
1 vote
2 answers
2k views

I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli. After checking the generated SASS, I found ...
JGL's user avatar
  • 158
0 votes
0 answers
398 views

docker run --gpus all --shm-size=4gb --name bookfinalfinal --volume D:/BankDataDocker/data:/usr/src/app/data --volume D:/BankDataDocker/run:/usr/src/app/runs yolo:v1 bash docker: Error response ...
JuniorScholar's user avatar
1 vote
1 answer
976 views

I have a kernel where each warp accumulates the sum of a chunk of data. At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the ...
Elad Maimoni's user avatar
  • 4,791
1 vote
1 answer
4k views

Traditionally, NVIDIA GPUs have offered CUDA thread blocks shared memory in amounts always divisible by 16 KiB (see e.g. in this table). However, with Ampere 8.0 and 8.6 GPUs, the amounts are 99 KiB ...
einpoklum's user avatar
  • 138k
0 votes
1 answer
146 views

My questions arise while reading the last example of B.5. Memory Fence Functions. I understand the flag in this example checks the final block processing a sum. In my imagination, if the flag is ...
JGL's user avatar
  • 158
1 vote
1 answer
960 views

I was reading a slide-deck from Nvidia (Optimizing Parallel Reduction in CUDA) discussing a parallel reduction algorithm. Slide 15: Here the writer discusses that by using sequential addressing, we ...
Reza Namvar's user avatar
0 votes
1 answer
224 views

I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS. The hash table constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff; struct OnceLock {...
JGL's user avatar
  • 158
2 votes
1 answer
465 views

I'm reading the book "Programming Massively Parallel Processor" (3rd edition) that presents an implementation of the Kogge-Stone parallel scan algorithm. This algorithm is meant to be run by ...
Damiano Massarelli's user avatar
0 votes
0 answers
294 views

Some CUDA devices support different shared memory bank sizes (4 bytes, 8 bytes); others support just one (typically/always 4 bytes). As I have come to realize, I won't get an error trying to set the ...
einpoklum's user avatar
  • 138k
0 votes
1 answer
476 views

Consider the program below. It gets a CUDA device's current shared memory bank size configuration; sets it to another value; then gets it again. Unfortunately, this is the output: The reported shared ...
einpoklum's user avatar
  • 138k
3 votes
1 answer
721 views

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; ...
Yichen's user avatar
  • 101
2 votes
1 answer
1k views

I would like to be sure that I correctly understand bank conflicts in shared memory. I have 32 segments of data. These segments consist of 128 integers each. [[0, 1, ..., 126, 127], [128, 129, ..., ...
Piotr K.'s user avatar

1
2 3 4 5
8