Newest 'dynamic-parallelism' Questions

1 vote

1 answer

110 views

What is the actual maximum nesting depth of dynamic parallelism in CUDA?

Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase ...

StefanoTrv

308

asked Aug 12 at 13:29

3 votes

1 answer

139 views

Does Clang support dynamic parallelism in cuda?

Dynamic parallelism means kernels calls kernels. Its possible to compile CUDA program using clang, but do clang support dynamic parallelism ? I am getting this error when attempting to compile a CUDA ...

michael101

53

asked Jun 30 at 10:40

3 votes

0 answers

100 views

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6)

I need to use a function like cudaDeviceSynchronize to wait for a kernel to finish execution. However after version 11.6, it is no longer possible to use any form of synchronization within device ...

ug0x01

53

asked Mar 9 at 14:19

0 votes

1 answer

250 views

Can I get the device-side pointer to a CUDA kernel using its mangled symbol name?

Note: In case the post seems long, one can directly jump to the section starting with "I was wondering.." at the end, in case one wants to skip the buildup/context. Buildup/Context: For the ...

Abhishek Ghosh

715

asked Sep 20, 2024 at 19:19

0 votes

1 answer

199 views

How to ensure that a child kernel finished processing before the parent kernel continues?

I am learning CUDA and decided to do a basic image box blur demo as a way to get familiar. When I try to make a child kernel compute the sum of the neighboring pixels, the sum is always 0. After some ...

Amr Emad

3

asked Jul 30, 2024 at 14:20

3 votes

1 answer

185 views

CUDA dynamic parallelism -- Is there a way to infinitely nest kernel launches?

Now, I'm using CUDA dynamic parallelism to create the kernel in a kernel function. In the CUDA document, kernel functions can only be launched a fixed recursion depth because of resource constraints. ...

Frostmourne

35

asked Jul 14, 2024 at 8:28

0 votes

1 answer

317 views

CUDA dynamic parallelism is computing sequentially

I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t ...

Daniil Tarpanov

3

asked Jan 13, 2023 at 14:13

0 votes

1 answer

469 views

How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?

So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a ...

yugi957

11

asked Jan 7, 2023 at 20:03

0 votes

1 answer

2k views

Can I copy files from Sharepoint to Azure Blob Storage using dynamic file path?

I am building a pipeline to copy files from Sharepoint to Azule Blob Storage at work. After reading some documentation, I was able to create a pipeline that only copies certain files. However, I would ...

user19504539

asked Jul 14, 2022 at 0:25

0 votes

1 answer

441 views

CUDA dynamic parallelism: Access child kernel results in global memory

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this: int aPayloads[32]; // Compute aPayloads start values here ...

Silicomancer

9,394

asked Feb 16, 2022 at 0:51

1 vote

1 answer

615 views

Why can't I link to my CUDA static library that uses Dynamic Parallelism and Separable Compilation?

I'm trying to create the most basic CUDA application to demonstrate Dynamic Parallelism, Separate Compilation and Linking, a CUDA kernel in a static library, and I'm trying to use CMake to generate a ...

Justin

1,949

asked Oct 22, 2021 at 12:43

0 votes

1 answer

691 views

Can a CUDA parent kernel launch a child kernel with more threads than the parent?

I'm trying to learn how to use CUDA Dynamic Parallelism. I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with ...

Justin

1,949

asked Oct 9, 2021 at 14:31

0 votes

1 answer

2k views

Why is cudaLaunchCooperativeKernel() returning not permitted?

So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so ...

abhishekpurandare1297

37

asked Dec 22, 2020 at 9:23

1 vote

1 answer

1k views

How to call a Thrust function in a stream from a kernel?

I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be ...

heapoverflow

324

asked Sep 24, 2019 at 11:07

0 votes

1 answer

200 views

Nvidia visual profiler not showing cudaMalloc() after kernel launch

I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic ...

progammer

2,062

asked Oct 23, 2018 at 9:43

7 votes

0 answers

202 views

AleaGPU Dynamic Parallelism in F#? How?

This might be a simple question, but I have not been able to find any references to this topic: How do I launch a kernel from within another kernel?. The only relevant example I came across is the ...

Alex Gheith

71

asked Jul 2, 2018 at 13:34

0 votes

1 answer

491 views

Synchronizing depth of nested kernels

Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x child kernels in different streams to maximize parallel throughput. We then ...

user2255757

766

asked May 22, 2018 at 19:10

0 votes

1 answer

882 views

compile multiple cuda files (that have dynamic parallelism) and MPI code

I have a bunch of .cu files that use dynamic parallelism (a.cu, b.cu, c.cu.., e.cu, f.cu), and a main.c file that uses MPI to call functions from a.cu on multiple nodes. I'm trying to write a make ...

user2330963

7

asked Apr 3, 2018 at 19:48

1 vote

1 answer

982 views

Synchronization in CUDA dynamic parallelism

I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion: __global__ void getMax(...

Matias Haeussler

1,141

asked Dec 23, 2017 at 21:28

3 votes

1 answer

195 views

Dynamic Parallelism on GTX 980 ti: Unknown Error

I am attempting dynamic parallelism on a GTX 980 ti card. All attempts at running code return "unknown error". Simple code is shown below with compilation options. I can execute kernels at depth=0 ...

AshleyG

31

asked Jul 30, 2017 at 17:17

7 votes

2 answers

6k views

CUDA Dynamic Parallelism, bad performance

We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach. We made the simplest reproducible code to ...

Cristobal Navarro

336

asked Jul 19, 2017 at 21:10

0 votes

1 answer

1k views

How can I synchronize device-side command queues with host-side queues? clFinish() and markerWithWaitList gives invalid queue error

I'm using OpenCL 2.0 dynamic parallelism feature and have each workitem enqueue another kernel with single workitem. When work completion time of child kernel is high, parent kernel completes before ...

huseyin tugrul buyukisik

12k

asked Jul 2, 2017 at 13:02

1 vote

1 answer

801 views

CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism

Kernel codes that produce the error: __kernel void testDynamic(__global int *data) { int id=get_global_id(0); atomic_add(&data[1],2); } __kernel void test(__global int * data) { int ...

huseyin tugrul buyukisik

12k

asked Jul 1, 2017 at 21:36

7 votes

1 answer

758 views

CUDA device runtime api cudaMemsetAsync doesn't work

I am trying to call cudaMemsetAsync from kernel (so called "dynamic parallelism"). But no matter what value I use, it always set memory to 0. Here is my test code: #include "cuda_runtime.h" #include ...

Xiang Zhang

2,983

asked May 17, 2017 at 14:58

0 votes

1 answer

401 views

Using shared memory in Dynamic Parallelism CUDA

Question 1: Do I have to specify the amount of dynamic shared memory to be allocated at the launch of parent kernel if shared memory is only used by child kernel. Question 2: The following is my ...

Aliya Clark

131

asked Mar 30, 2017 at 8:46

3 votes

1 answer

400 views

What factors effect the overhead of dynamic parallelism kernel launches?

When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, ...

einpoklum

138k

asked Mar 20, 2017 at 15:01

0 votes

1 answer

1k views

Dynamic parallelism - passing contents of shared memory to spawned blocks?

While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able ...

einpoklum

138k

asked Feb 16, 2017 at 10:01

0 votes

1 answer

897 views

"device-function-maxrregcount" message while compiling cuda code

I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code: using ...

starrr

1,033

asked Nov 30, 2016 at 21:43

-2 votes

1 answer

143 views

CUDA Dynamic Parallelism Deferencing Global Memory

To test out dynamic parallelism, I wrote a simple code and compiled it on GTX1080 with the following commands. nvcc -arch=sm_35 -dc dynamic_test.cu -o dynamic_test.o nvcc -arch=sm_35 dynamic_test....

JYC

1

asked Oct 11, 2016 at 3:12

2 votes

1 answer

536 views

"unknown error" on first cudaMalloc if CUBLAS is present in kernel

I have the following minimal .cu file #include <cuda_runtime_api.h> #include <cublas_v2.h> #include <cstdio> __global__ void test() { cublasHandle_t handle = nullptr; ...

Joe

6,857

asked Sep 19, 2016 at 8:15

1 vote

3 answers

3k views

Generating Relocatable Device Code using Nvidia Nsight

I'm trying to compile a dynamic parallelism example on CUDA and when i try to compile it gives and error saying, kernel launch from __device__ or __global__ functions requires separate compilation ...

BAdhi

530

asked Jul 8, 2016 at 6:52

0 votes

1 answer

327 views

Trouble compiling/running CUDA code involving dynamic parallelism

I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step. I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5. Depending on the switches ...

VincentN

73

asked Jun 20, 2016 at 15:25

3 votes

1 answer

1k views

How to perform relational join on two data containers on GPU (preferably CUDA)?

What I'm trying to do: On the GPU, I'm trying to mimic the conventions used by SQL in relational algebra to perform joins on tables (e.g. Inner Join, Outer Join, Cross Join). In the code below, I'm ...

aiwyn

278

asked Jun 14, 2016 at 13:25

0 votes

1 answer

173 views

Accessing CUDA built-in variable in child kernel

I'm trying to use Kepler's Dynamic Parallelism for one of my application. The global index of the thread (in the parent kernel) launching the child kernel is needed in the child kernel. In other words,...

user3813674

2,683

asked Sep 2, 2015 at 14:46

0 votes

1 answer

678 views

Cublas not working within kernel once compiled to cubin using -G flag with nvcc

I have a CUDA kernel that looks like the following: #include <cublas_v2.h> #include <math_constants.h> #include <stdio.h> extern "C" { __device__ float ONE = 1.0f; ...

Bam4d

610

asked Aug 19, 2015 at 9:50

1 vote

2 answers

2k views

Nested Directives in OpenACC

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7. My code is so simple. When I try to compile ...

grypp

435

asked Aug 12, 2015 at 10:21

0 votes

1 answer

75 views

Do kernel-launched child kernels have the same warp size as host-launched kernels?

When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs. ...

mmdanziger

4,698

asked Jul 6, 2015 at 8:29

1 vote

2 answers

2k views

Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,...

John

3,115

asked Jun 11, 2015 at 11:24

1 vote

1 answer

2k views

Is it possible to call cublas functions from a device function?

In here Robert Crovella said that cublas routines can be called from device code. Although I am using dynamic parallelism and compiling with compute capability 3.5, I cannot manage to call Cublas ...

emartel

49

asked Mar 19, 2015 at 11:00

0 votes

1 answer

229 views

cuda dynamic parallelism linkage error extern c

I'm trying to link my CUDA Kepler's Dynamic Parallelism program as follows: nvcc -m32 -arch=sm_35 -dc -Xcompiler '-fPIC' DFS_Solving.cu nvcc -m32 -arch=sm_35 -Xcompiler '-fPIC' -dlink DFS_Solving.o -...

andersonbp

1

asked Mar 11, 2015 at 15:31

-1 votes

1 answer

1k views

Dynamic Parallelism - separate compilation: undefined reference to __cudaRegisterLinkedBinary

Although I have followed apendix C "Compiling Dynamic Parallelism" from "CUDA Programming Guide" and the solutions given here, I cannot manage to solve the problem I have. After the compilation and ...

emartel

49

asked Feb 25, 2015 at 13:12

9 votes

1 answer

10k views

compilation .cu files with Dynamic Parallelism(CUDA)

I switched to a new GPU GeForce GTX 980 with cc 5.2, so it must support dynamic parallelism. However, I was not able to compile even a simple code (from programming guide). I will not provide it here (...

Mikhail Genkin

3,470

asked Jan 9, 2015 at 0:08

3 votes

1 answer

1k views

CUDA dynamic parallelism with Driver API

I'm trying to compile and link a dynamic kernel and use it with the CUDA driver API on a GK110. I compile the .cu source file in Visual Studio with the relocatable device code flag and compute_35, ...

FHoenig

359

asked Jan 7, 2015 at 22:35

1 vote

1 answer

2k views

CUDA recursion depth

When using Dynamic Parallelism in CUDA, you can implement recursive algorithms like mergeSort. I have implemented it and my program don't work for inputs greater than blah. My question is how many ...

AmirSojoodi

1,360

asked Jan 3, 2015 at 17:01

2 votes

1 answer

2k views

CUDA - How to make thread in kernel wait for it's children

I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait ...

Eugênio Fonseca

315

asked Dec 1, 2014 at 21:43

1 vote

1 answer

1k views

numba.typeinfer.TypingError: Untyped global name 'child_launch' when using CUDA Dynamic Parallelism in Python ( Anaconda ) on NVIDIA GPU

My code is here: import numpy as np from numbapro import cuda @cuda.autojit def child_launch(data): data[cuda.threadIdx.x] = data[cuda.threadIdx.x] + 100 @cuda.autojit def parent_launch(data): ...

Ethan Huang

11

asked Oct 13, 2014 at 7:40

0 votes

1 answer

367 views

Dynamic Parallelism in CUDA not working

I wrote a simple code to understand Dynamic Parallelism. From the values being printed,I see that the child kernel has executed correctly, but when I come back to the parent kernel, I see wrong values ...

Jagannath

47

asked Sep 14, 2014 at 10:30

2 votes

0 answers

212 views

Does nvcc support tail call optimization in dynamic parallelism?

Under the CUDA Programming Guide section C.4.3.1.2. "Nesting and Synchronization Depth", it is mentioned: "An optimization is permitted where the system detects that it need not reserve space for ...

peteraldaron

37

asked Sep 4, 2014 at 16:54

0 votes

1 answer

481 views

CUDA dynamic parallelism: invalid global write when using texture memory [closed]

I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value. If the child kernel, say a different one, doesn't use texture memory, everything ...

salvaS

13

asked Aug 20, 2014 at 15:26

0 votes

1 answer

1k views

Nvidia Jetson TK1 Development Board - Cuda Compute Capability

I have quite impressed with this deployment kit. Instead of buying a new CUDA card, which might require new main board and etc, this card seems provide all in one. At it's specs it says it has CUDA ...

phoad

1,881

asked Jul 20, 2014 at 7:16

Collectives™ on Stack Overflow