360 questions
1
vote
1
answer
171
views
memcpy_async does not work with pipeline roles
If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below.
This code prefetches data within a single warp.
I want to expand this, so that I can prefetch data in ...
2
votes
1
answer
90
views
Degree of Bank conflicts in cuda - Picture not clear from GPU GEMS Prefix Sum article
I am trying to understand this article : https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
More specifically bank-conflicts is what I am ...
0
votes
1
answer
202
views
How to properly use VK_KHR_external_memory for sharing memory between two processes using Vulkan API
I am trying to share memory between two Vulkan processes (using the same NVidia gpu device) using the VK_KHR_external memory extension on Linux (Ubuntu 22). I create a buffer/device memory (and ...
2
votes
0
answers
86
views
Reusing __shared__ data between __global__ functions
Is there an officially sanctioned way to reuse shared data between global functions?
Consider the following code https://cuda.godbolt.org/z/KMj9EKKbf:
#include <cuda.h>
#include <stdio.h>
...
-3
votes
1
answer
131
views
N-way bank conflict on GPU shared memory in 64-bit mode and access order across words
I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts:
Two-way bank conflict
Three-way bank conflict
Figure below is how the words are mapped ...
0
votes
1
answer
55
views
Is it possible to read an array of unsigned char as a long and then save it into another array of unsigned char?
I am developing a CUDA kernel that takes as input an array input of unsigned char of size n * 57, where n is a kernel argument. NSight Compute reports that the memory accesses are not coalesced, ...
4
votes
1
answer
135
views
data broadcasting from shared memory bank
I have been trying to understand how data broadcasting works. In terms of this fact, I have designed two distinct kernel (in the aspect of reading data from shared memory). I have tried compare the ...
1
vote
1
answer
68
views
CUDA Shared Memory Dynamic Memory Allocation [closed]
The problem I am having is that the matrixMulKernel_tiled kernel function is not performing any summing. I am unsure as to why the output matrix is consistently all zeros. Any thoughts? Please note, ...
0
votes
1
answer
139
views
Raw kernel with dynamically allocated shared memory
Consider the following CUDA kernel that is used in Python via CuPy from the CuPy docs
add_kernel = cp.RawKernel(r'''
extern "C" __global__
void my_add(const float* x1, const float* x2, float*...
0
votes
2
answers
164
views
CUDA shared memory with different dtypes
I would like to write a CUDA kernel that uses two different (dynamically) shared memory arrays in the following form
__global__ myKernel()
{
extern __shared__ int localSum1[];
extern ...
0
votes
1
answer
161
views
Estimated transactions on coalesced memory accesses
I've queried the CUDA device (T1000 SM_75) and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.)
#include <cuda.h>
#include &...
1
vote
0
answers
274
views
CUDA matrix transpose with shared mem
I am trying to incrementally optimize matrix transpose operation on CUDA and gain some hands on experience. I have tried a few things but the timing measurements that I am getting do not make sense. ...
3
votes
0
answers
385
views
Bank Conflict Issue in CUDA Shared Memory Access
I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single ...
1
vote
1
answer
386
views
cudaFuncSetSharedMemConfig is deprecated in 12.4 - why?
After just upgrading to CUDA 12.4 and recompiling my project, I got the following warning:
Experiments.cu:188:39: warning: ‘cudaError_t cudaFuncSetSharedMemConfig(T*, cudaSharedMemConfig) [with T = ...
0
votes
1
answer
376
views
Correct way of using cuda __shared__ memory for image filtering
I am writing a CUDA C++ code for image filtering. CUDA separates the image data into blocks for parallel processing. For regular pixel-wise processing of course it is fast. However in image filtering, ...
4
votes
1
answer
180
views
Reinterpret cast on *shared memory*
I have a fairly large object that I want to load into shared memory so that multiple warps can access the object, something like
struct alignas(8) Foo{
int a;
float b;
vec2 c;
uvec2 d;
...
0
votes
1
answer
132
views
What is the difference of dynamic shared memory as kernel attribute and kernel argument in CUDA
Wer are using dynamic shared memory in our CUDA kernels. We are setting the size of the shared memory for each kernel using the driver API cuFuncSetAttribute and ...
2
votes
0
answers
186
views
Can memory read and write operations overlap in CUDA programming?
I know in CUDA programming, memory reads at different levels can overlap. For example, data transfers from global memory to shared memory and from shared memory to registers can overlap. But can read ...
2
votes
0
answers
173
views
Why is there no Shared Memory Bank conflict when loading consecutive half floats or vectorized int4?
I expect a cuda shared memory bank conflict in the following two situations:
Accessing successive half floats (2 words) with successive threads
Accessing vectorized int4 datatypes by successive ...
0
votes
1
answer
135
views
Use of Mixture of Static and Dynamic Shared Memory in Nested Arrays for Cuda Kernels
I sometimes see the following shared memory declaration in CUDA kernels, and I am not sure what it means:
extern __shared__ T shmem[][SZ]
with SZ being a compile-time constant. The kernel is launched ...
0
votes
0
answers
77
views
Why this code that uses dynamically allocated shared memory in CUDA does not work? [duplicate]
The next kernel performs the multiplication of the matrices matA and matB and stores the result in the matrix matC (the size of all matrices is N) using a shared memory region with dimensions tiledim ...
0
votes
1
answer
367
views
In V100 GPU or A100 GPU, CUDA COREs- data movement path - where do they look first for data in Shared Memory or L1 cache
I assume in context to data fetch for CUDA core - registers are the fastest, next shared memory , then L1 cache, next L2 cache and then global memory is the slowest.
I assume in a GPU data moves in ...
1
vote
0
answers
154
views
Understanding the Reduction in Bank Conflicts in CUDA Kernels
I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication:
gemm3: baseline of shared memory GEMM
gemm4: less thread blocks in x dimension
gemm5: less blocks in both ...
1
vote
0
answers
106
views
Still bank conflict after shared memory padding
As the trick described in here, I tested the following code and got the corresponding profiling result. Conflicts were notably diminished, but some still persist.
// store conflict
__global__ void ...
0
votes
1
answer
73
views
CUDA transpose kernel fails randomly
I am trying to transpose a matrix. It works as expected for some values and starts crashing with bigger ones or even between executions of the program.
What I am trying to make is to split the matrix ...
2
votes
0
answers
107
views
Efficient access to global memory to pre-calculated locations
I am making an particle-based code in which every particle's neighbor lists are already generated with the initial condition and unchanged during the whole simulation, but the positions of each ...
8
votes
1
answer
7k
views
What is warp shuffling in CUDA and why is it useful?
From the CUDA Programming Guide:
[Warp shuffle functions] exchange a variable between threads within a warp.
I understand that this is an alternative to shared memory, thus it's being used for ...
1
vote
1
answer
1k
views
Thread block clusters and distributed shared memory not working as intended
I have written a simple CUDA program to perform array reduction using thread block clusters and distributed shared memory. I am compiling it with CUDA 12.0 and running on a hopper GPU. Below is the ...
0
votes
1
answer
135
views
CUDA shared memory bank conflict unexpected timing
I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 ...
-2
votes
1
answer
92
views
Why __shared__ memory causes error in calculations
I had a bug in my cuda kernel code. I managed to know where is the bug and fix it, but it would be nice to know what caused it exactly.
I am saving two numbers (p_x, p_y) [3D array XY sizes] in a ...
7
votes
3
answers
3k
views
In CUDA, what instruction is used to load data from global memory to shared memory?
I am currently studying CUDA and learned that there are global memory and shared memory.
I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/...
1
vote
1
answer
3k
views
CUDA pipeline asynchronous memory copy from global to shared memory
I'm currently learning how to write fast CUDA kernels. I implemented a tiled matrix multiplication (block size 32x32) which only does coalesc reads/writes from/to global memory and has no bank ...
2
votes
0
answers
644
views
Problem with training stylegan3 - got stuck on Setting up PyTorch plugin "upfirdn2d_plugin"
I faced a problem on training stylegan3 where the terminal stuck at "Setting up PyTorch plugin "upfirdn2d_plugin"... ". I have tried all the methods I found, such as reinstall ...
2
votes
1
answer
465
views
Are load and store operations in shared memory atomic?
I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.
On the one hand, it seems that any load/store is compiled to ...
0
votes
1
answer
2k
views
Does CUDA broadcast shared memory to all threads in a block without a bank conflict?
In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads.
However it doesn't state what happens if the ...
2
votes
0
answers
1k
views
CUDA C++ How to programs to benchmark shared memory bandwidth?
I'm looking for a way to benchmark shared memory and L1/L2 cache. However, the benchmark results I found are very different depending on the source.
In this paper, Dissecting the NVIDIA Volta GPU ...
0
votes
1
answer
2k
views
Cuda misaligned address for a reused shared block memory
My kernel allocated a shared memory for data storage, but bug reports if I change the size of the shared memory, see codes attached.
#include <stdio.h>
#include <assert.h>
#define ...
0
votes
1
answer
717
views
Is there still shared mem bank conflict in nvidia cuda compute capability 7.0 and above?
If all threads in same block visit the same address i.e. array[0]
for some old compute capability, there is a bank conflict.
But does this conflict still exist for the latest compute capabilities (i.e....
1
vote
2
answers
2k
views
Shared Memory's atomicAdd with int and float have different SASS
I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli.
After checking the generated SASS, I found ...
0
votes
0
answers
398
views
docker run -gpus all with multiple volume for yolov5
docker run --gpus all --shm-size=4gb --name bookfinalfinal --volume D:/BankDataDocker/data:/usr/src/app/data --volume D:/BankDataDocker/run:/usr/src/app/runs yolo:v1 bash
docker: Error response ...
1
vote
1
answer
976
views
Can a Cuda warp communicate with a different warp without using shared memory?
I have a kernel where each warp accumulates the sum of a chunk of data.
At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the ...
1
vote
1
answer
4k
views
Why is the max amount of shared memory per block on Ampere GPUs not a multiple of 16 KiB?
Traditionally, NVIDIA GPUs have offered CUDA thread blocks shared memory in amounts always divisible by 16 KiB (see e.g. in this table). However, with Ampere 8.0 and 8.6 GPUs, the amounts are 99 KiB ...
0
votes
1
answer
146
views
Thread synchronization necessity on a `volatile __shared__` flag
My questions arise while reading the last example of B.5. Memory Fence Functions.
I understand the flag in this example checks the final block processing a sum.
In my imagination, if the flag is ...
1
vote
1
answer
960
views
Shared Memory Bank Conflicts in Parallel Reduction Algorithm
I was reading a slide-deck from Nvidia (Optimizing Parallel Reduction in CUDA) discussing a parallel reduction algorithm.
Slide 15:
Here the writer discusses that by using sequential addressing, we ...
0
votes
1
answer
224
views
racecheck error from a data structure in shared memory
I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS.
The hash table
constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff;
struct OnceLock {...
2
votes
1
answer
465
views
CUDA parallel scan algorithm shared memory race condition
I'm reading the book "Programming Massively Parallel Processor" (3rd edition) that presents an implementation of the Kogge-Stone parallel scan algorithm.
This algorithm is meant to be run by ...
0
votes
0
answers
294
views
How can I tell whether a CUDA device has a fixed shared memory bank size?
Some CUDA devices support different shared memory bank sizes (4 bytes, 8 bytes); others support just one (typically/always 4 bytes).
As I have come to realize, I won't get an error trying to set the ...
0
votes
1
answer
476
views
Why does my GPU refuse to accept a shared memory configuration without emitting an error?
Consider the program below.
It gets a CUDA device's current shared memory bank size configuration; sets it to another value; then gets it again. Unfortunately, this is the output:
The reported shared ...
3
votes
1
answer
721
views
CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error
I'm using inline PTX ld.shared to load data from shared memory:
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory
float Csub = 0;
As[TY][TX] = A[a + wA * TY + TX]; ...
2
votes
1
answer
1k
views
CUDA memory bank conflict
I would like to be sure that I correctly understand bank conflicts in shared memory.
I have 32 segments of data.
These segments consist of 128 integers each.
[[0, 1, ..., 126, 127], [128, 129, ..., ...