Newest 'gpu-shared-memory' Questions

1 vote

1 answer

171 views

memcpy_async does not work with pipeline roles

If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below. This code prefetches data within a single warp. I want to expand this, so that I can prefetch data in ...

Johan

77.4k

asked Oct 21 at 6:37

2 votes

1 answer

90 views

Degree of Bank conflicts in cuda - Picture not clear from GPU GEMS Prefix Sum article

I am trying to understand this article : https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda More specifically bank-conflicts is what I am ...

user8469759

2,948

asked Jul 18 at 14:30

0 votes

1 answer

202 views

How to properly use VK_KHR_external_memory for sharing memory between two processes using Vulkan API

I am trying to share memory between two Vulkan processes (using the same NVidia gpu device) using the VK_KHR_external memory extension on Linux (Ubuntu 22). I create a buffer/device memory (and ...

pettersson

343

asked Jun 12 at 6:33

2 votes

0 answers

86 views

Reusing shared data between global functions

Is there an officially sanctioned way to reuse shared data between global functions? Consider the following code https://cuda.godbolt.org/z/KMj9EKKbf: #include <cuda.h> #include <stdio.h> ...

Johan

77.4k

asked Jun 3 at 9:42

-3 votes

1 answer

131 views

N-way bank conflict on GPU shared memory in 64-bit mode and access order across words

I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts: Two-way bank conflict Three-way bank conflict Figure below is how the words are mapped ...

kdh

194

asked Mar 2 at 7:40

0 votes

1 answer

55 views

Is it possible to read an array of unsigned char as a long and then save it into another array of unsigned char?

I am developing a CUDA kernel that takes as input an array input of unsigned char of size n * 57, where n is a kernel argument. NSight Compute reports that the memory accesses are not coalesced, ...

ric

101

asked Jan 14 at 20:30

4 votes

1 answer

135 views

data broadcasting from shared memory bank

I have been trying to understand how data broadcasting works. In terms of this fact, I have designed two distinct kernel (in the aspect of reading data from shared memory). I have tried compare the ...

log0xFF

43

asked Jan 14 at 2:30

1 vote

1 answer

68 views

CUDA Shared Memory Dynamic Memory Allocation [closed]

The problem I am having is that the matrixMulKernel_tiled kernel function is not performing any summing. I am unsure as to why the output matrix is consistently all zeros. Any thoughts? Please note, ...

Maayan Israel

47

asked Oct 30, 2024 at 6:12

0 votes

1 answer

139 views

Raw kernel with dynamically allocated shared memory

Consider the following CUDA kernel that is used in Python via CuPy from the CuPy docs add_kernel = cp.RawKernel(r''' extern "C" __global__ void my_add(const float* x1, const float* x2, float*...

Uwe.Schneider

1,487

asked Oct 3, 2024 at 18:03

0 votes

2 answers

164 views

CUDA shared memory with different dtypes

I would like to write a CUDA kernel that uses two different (dynamically) shared memory arrays in the following form __global__ myKernel() { extern __shared__ int localSum1[]; extern ...

Uwe.Schneider

1,487

asked Oct 3, 2024 at 11:54

0 votes

1 answer

161 views

Estimated transactions on coalesced memory accesses

I've queried the CUDA device (T1000 SM_75) and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.) #include <cuda.h> #include &...

sof

9,767

asked Jul 12, 2024 at 10:57

1 vote

0 answers

274 views

CUDA matrix transpose with shared mem

I am trying to incrementally optimize matrix transpose operation on CUDA and gain some hands on experience. I have tried a few things but the timing measurements that I am getting do not make sense. ...

Saydon

27

asked Jun 14, 2024 at 13:39

3 votes

0 answers

385 views

Bank Conflict Issue in CUDA Shared Memory Access

I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single ...

Sunjnn

51

asked Mar 28, 2024 at 2:21

1 vote

1 answer

386 views

cudaFuncSetSharedMemConfig is deprecated in 12.4 - why?

After just upgrading to CUDA 12.4 and recompiling my project, I got the following warning: Experiments.cu:188:39: warning: ‘cudaError_t cudaFuncSetSharedMemConfig(T*, cudaSharedMemConfig) [with T = ...

Serge Rogatch

15.3k

asked Mar 23, 2024 at 14:16

0 votes

1 answer

376 views

Correct way of using cuda shared memory for image filtering

I am writing a CUDA C++ code for image filtering. CUDA separates the image data into blocks for parallel processing. For regular pixel-wise processing of course it is fast. However in image filtering, ...

MeiH

1,875

asked Mar 1, 2024 at 4:34

4 votes

1 answer

180 views

Reinterpret cast on shared memory

I have a fairly large object that I want to load into shared memory so that multiple warps can access the object, something like struct alignas(8) Foo{ int a; float b; vec2 c; uvec2 d; ...

Krupip

5,356

asked Feb 2, 2024 at 5:46

0 votes

1 answer

132 views

What is the difference of dynamic shared memory as kernel attribute and kernel argument in CUDA

Wer are using dynamic shared memory in our CUDA kernels. We are setting the size of the shared memory for each kernel using the driver API cuFuncSetAttribute and ...

msedi

1,815

asked Jan 19, 2024 at 15:18

2 votes

0 answers

186 views

Can memory read and write operations overlap in CUDA programming?

I know in CUDA programming, memory reads at different levels can overlap. For example, data transfers from global memory to shared memory and from shared memory to registers can overlap. But can read ...

Y. Chen

51

asked Jan 13, 2024 at 16:04

2 votes

0 answers

173 views

Why is there no Shared Memory Bank conflict when loading consecutive half floats or vectorized int4?

I expect a cuda shared memory bank conflict in the following two situations: Accessing successive half floats (2 words) with successive threads Accessing vectorized int4 datatypes by successive ...

fabian

1,881

asked Jan 13, 2024 at 8:40

0 votes

1 answer

135 views

Use of Mixture of Static and Dynamic Shared Memory in Nested Arrays for Cuda Kernels

I sometimes see the following shared memory declaration in CUDA kernels, and I am not sure what it means: extern __shared__ T shmem[][SZ] with SZ being a compile-time constant. The kernel is launched ...

fabian

1,881

asked Jan 6, 2024 at 11:15

0 votes

0 answers

77 views

Why this code that uses dynamically allocated shared memory in CUDA does not work? [duplicate]

The next kernel performs the multiplication of the matrices matA and matB and stores the result in the matrix matC (the size of all matrices is N) using a shared memory region with dimensions tiledim ...

Athanasios Margaris

117

asked Nov 1, 2023 at 15:55

0 votes

1 answer

367 views

In V100 GPU or A100 GPU, CUDA COREs- data movement path - where do they look first for data in Shared Memory or L1 cache

I assume in context to data fetch for CUDA core - registers are the fastest, next shared memory , then L1 cache, next L2 cache and then global memory is the slowest. I assume in a GPU data moves in ...

user2166888

589

asked Sep 25, 2023 at 22:54

1 vote

0 answers

154 views

Understanding the Reduction in Bank Conflicts in CUDA Kernels

I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication: gemm3: baseline of shared memory GEMM gemm4: less thread blocks in x dimension gemm5: less blocks in both ...

Worldbuffer

55

asked Aug 20, 2023 at 16:54

1 vote

0 answers

106 views

Still bank conflict after shared memory padding

As the trick described in here, I tested the following code and got the corresponding profiling result. Conflicts were notably diminished, but some still persist. // store conflict __global__ void ...

picklesmithy129

81

asked Aug 13, 2023 at 15:32

0 votes

1 answer

73 views

CUDA transpose kernel fails randomly

I am trying to transpose a matrix. It works as expected for some values and starts crashing with bigger ones or even between executions of the program. What I am trying to make is to split the matrix ...

BrightSoul

37

asked May 31, 2023 at 15:59

2 votes

0 answers

107 views

Efficient access to global memory to pre-calculated locations

I am making an particle-based code in which every particle's neighbor lists are already generated with the initial condition and unchanged during the whole simulation, but the positions of each ...

Sangjun Lee

455

asked May 19, 2023 at 5:07

8 votes

1 answer

7k views

What is warp shuffling in CUDA and why is it useful?

From the CUDA Programming Guide: [Warp shuffle functions] exchange a variable between threads within a warp. I understand that this is an alternative to shared memory, thus it's being used for ...

gonidelis

1,125

asked Apr 27, 2023 at 18:54

1 vote

1 answer

1k views

Thread block clusters and distributed shared memory not working as intended

I have written a simple CUDA program to perform array reduction using thread block clusters and distributed shared memory. I am compiling it with CUDA 12.0 and running on a hopper GPU. Below is the ...

Ricky Dev

59

asked Apr 21, 2023 at 5:58

0 votes

1 answer

135 views

CUDA shared memory bank conflict unexpected timing

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 ...

Ferdinand Mom

59

asked Nov 20, 2022 at 23:45

-2 votes

1 answer

92 views

Why shared memory causes error in calculations

I had a bug in my cuda kernel code. I managed to know where is the bug and fix it, but it would be nice to know what caused it exactly. I am saving two numbers (p_x, p_y) [3D array XY sizes] in a ...

adnan

1

asked Nov 15, 2022 at 13:25

7 votes

3 answers

3k views

In CUDA, what instruction is used to load data from global memory to shared memory?

I am currently studying CUDA and learned that there are global memory and shared memory. I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/...

Tae

125

asked Nov 15, 2022 at 5:05

1 vote

1 answer

3k views

CUDA pipeline asynchronous memory copy from global to shared memory

I'm currently learning how to write fast CUDA kernels. I implemented a tiled matrix multiplication (block size 32x32) which only does coalesc reads/writes from/to global memory and has no bank ...

chuber

11

asked Sep 13, 2022 at 13:45

2 votes

0 answers

644 views

Problem with training stylegan3 - got stuck on Setting up PyTorch plugin "upfirdn2d_plugin"

I faced a problem on training stylegan3 where the terminal stuck at "Setting up PyTorch plugin "upfirdn2d_plugin"... ". I have tried all the methods I found, such as reinstall ...

Fu Wenjin

11

asked Jun 18, 2022 at 21:39

2 votes

1 answer

465 views

Are load and store operations in shared memory atomic?

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA. On the one hand, it seems that any load/store is compiled to ...

Pierre T.

388

asked Jun 11, 2022 at 10:48

0 votes

1 answer

2k views

Does CUDA broadcast shared memory to all threads in a block without a bank conflict?

In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads. However it doesn't state what happens if the ...

Niteya Shah

1,824

asked Jun 1, 2022 at 23:26

2 votes

0 answers

1k views

CUDA C++ How to programs to benchmark shared memory bandwidth?

I'm looking for a way to benchmark shared memory and L1/L2 cache. However, the benchmark results I found are very different depending on the source. In this paper, Dissecting the NVIDIA Volta GPU ...

Huy Le

1,999

asked Apr 28, 2022 at 4:29

0 votes

1 answer

2k views

Cuda misaligned address for a reused shared block memory

My kernel allocated a shared memory for data storage, but bug reports if I change the size of the shared memory, see codes attached. #include <stdio.h> #include <assert.h> #define ...

Mangoccc

41

asked Mar 5, 2022 at 8:15

0 votes

1 answer

717 views

Is there still shared mem bank conflict in nvidia cuda compute capability 7.0 and above?

If all threads in same block visit the same address i.e. array[0] for some old compute capability, there is a bank conflict. But does this conflict still exist for the latest compute capabilities (i.e....

cctv

19

asked Feb 5, 2022 at 2:14

1 vote

2 answers

2k views

Shared Memory's atomicAdd with int and float have different SASS

I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli. After checking the generated SASS, I found ...

JGL

158

asked Jan 30, 2022 at 8:13

0 votes

0 answers

398 views

docker run -gpus all with multiple volume for yolov5

docker run --gpus all --shm-size=4gb --name bookfinalfinal --volume D:/BankDataDocker/data:/usr/src/app/data --volume D:/BankDataDocker/run:/usr/src/app/runs yolo:v1 bash docker: Error response ...

JuniorScholar

1

asked Jan 28, 2022 at 10:42

1 vote

1 answer

976 views

Can a Cuda warp communicate with a different warp without using shared memory?

I have a kernel where each warp accumulates the sum of a chunk of data. At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the ...

Elad Maimoni

4,791

asked Jan 25, 2022 at 17:29

1 vote

1 answer

4k views

Why is the max amount of shared memory per block on Ampere GPUs not a multiple of 16 KiB?

Traditionally, NVIDIA GPUs have offered CUDA thread blocks shared memory in amounts always divisible by 16 KiB (see e.g. in this table). However, with Ampere 8.0 and 8.6 GPUs, the amounts are 99 KiB ...

einpoklum

138k

asked Jan 14, 2022 at 10:52

0 votes

1 answer

146 views

Thread synchronization necessity on a `volatile shared` flag

My questions arise while reading the last example of B.5. Memory Fence Functions. I understand the flag in this example checks the final block processing a sum. In my imagination, if the flag is ...

JGL

158

asked Dec 28, 2021 at 3:32

1 vote

1 answer

960 views

Shared Memory Bank Conflicts in Parallel Reduction Algorithm

I was reading a slide-deck from Nvidia (Optimizing Parallel Reduction in CUDA) discussing a parallel reduction algorithm. Slide 15: Here the writer discusses that by using sequential addressing, we ...

Reza Namvar

139

asked Dec 26, 2021 at 21:50

0 votes

1 answer

224 views

racecheck error from a data structure in shared memory

I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS. The hash table constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff; struct OnceLock {...

JGL

158

asked Dec 15, 2021 at 11:51

2 votes

1 answer

465 views

CUDA parallel scan algorithm shared memory race condition

I'm reading the book "Programming Massively Parallel Processor" (3rd edition) that presents an implementation of the Kogge-Stone parallel scan algorithm. This algorithm is meant to be run by ...

Damiano Massarelli

55

asked Dec 14, 2021 at 21:44

0 votes

0 answers

294 views

How can I tell whether a CUDA device has a fixed shared memory bank size?

Some CUDA devices support different shared memory bank sizes (4 bytes, 8 bytes); others support just one (typically/always 4 bytes). As I have come to realize, I won't get an error trying to set the ...

einpoklum

138k

asked Nov 22, 2021 at 11:42

0 votes

1 answer

476 views

Why does my GPU refuse to accept a shared memory configuration without emitting an error?

Consider the program below. It gets a CUDA device's current shared memory bank size configuration; sets it to another value; then gets it again. Unfortunately, this is the output: The reported shared ...

einpoklum

138k

asked Nov 22, 2021 at 11:14

3 votes

1 answer

721 views

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; ...

Yichen

101

asked Nov 12, 2021 at 0:07

2 votes

1 answer

1k views

CUDA memory bank conflict

I would like to be sure that I correctly understand bank conflicts in shared memory. I have 32 segments of data. These segments consist of 128 integers each. [[0, 1, ..., 126, 127], [128, 129, ..., ...

Piotr K.

93

asked Sep 11, 2021 at 11:10

Collectives™ on Stack Overflow