Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
110 views

Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase ...
StefanoTrv's user avatar
3 votes
1 answer
139 views

Dynamic parallelism means kernels calls kernels. Its possible to compile CUDA program using clang, but do clang support dynamic parallelism ? I am getting this error when attempting to compile a CUDA ...
michael101's user avatar
3 votes
0 answers
100 views

I need to use a function like cudaDeviceSynchronize to wait for a kernel to finish execution. However after version 11.6, it is no longer possible to use any form of synchronization within device ...
ug0x01's user avatar
  • 53
0 votes
1 answer
250 views

Note: In case the post seems long, one can directly jump to the section starting with "I was wondering.." at the end, in case one wants to skip the buildup/context. Buildup/Context: For the ...
Abhishek Ghosh's user avatar
0 votes
1 answer
199 views

I am learning CUDA and decided to do a basic image box blur demo as a way to get familiar. When I try to make a child kernel compute the sum of the neighboring pixels, the sum is always 0. After some ...
Amr Emad's user avatar
3 votes
1 answer
185 views

Now, I'm using CUDA dynamic parallelism to create the kernel in a kernel function. In the CUDA document, kernel functions can only be launched a fixed recursion depth because of resource constraints. ...
Frostmourne's user avatar
0 votes
1 answer
317 views

I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t ...
Daniil Tarpanov's user avatar
0 votes
1 answer
469 views

So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a ...
yugi957's user avatar
  • 11
0 votes
1 answer
2k views

I am building a pipeline to copy files from Sharepoint to Azule Blob Storage at work. After reading some documentation, I was able to create a pipeline that only copies certain files. However, I would ...
user avatar
0 votes
1 answer
441 views

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this: int aPayloads[32]; // Compute aPayloads start values here ...
Silicomancer's user avatar
  • 9,394
1 vote
1 answer
615 views

I'm trying to create the most basic CUDA application to demonstrate Dynamic Parallelism, Separate Compilation and Linking, a CUDA kernel in a static library, and I'm trying to use CMake to generate a ...
Justin's user avatar
  • 1,949
0 votes
1 answer
691 views

I'm trying to learn how to use CUDA Dynamic Parallelism. I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with ...
Justin's user avatar
  • 1,949
0 votes
1 answer
2k views

So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so ...
abhishekpurandare1297's user avatar
1 vote
1 answer
1k views

I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be ...
heapoverflow's user avatar
0 votes
1 answer
200 views

I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic ...
progammer's user avatar
  • 2,062
7 votes
0 answers
202 views

This might be a simple question, but I have not been able to find any references to this topic: How do I launch a kernel from within another kernel?. The only relevant example I came across is the ...
Alex Gheith's user avatar
0 votes
1 answer
491 views

Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x child kernels in different streams to maximize parallel throughput. We then ...
user2255757's user avatar
0 votes
1 answer
882 views

I have a bunch of .cu files that use dynamic parallelism (a.cu, b.cu, c.cu.., e.cu, f.cu), and a main.c file that uses MPI to call functions from a.cu on multiple nodes. I'm trying to write a make ...
user2330963's user avatar
1 vote
1 answer
982 views

I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion: __global__ void getMax(...
Matias Haeussler's user avatar
3 votes
1 answer
195 views

I am attempting dynamic parallelism on a GTX 980 ti card. All attempts at running code return "unknown error". Simple code is shown below with compilation options. I can execute kernels at depth=0 ...
AshleyG's user avatar
  • 31
7 votes
2 answers
6k views

We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach. We made the simplest reproducible code to ...
Cristobal Navarro's user avatar
0 votes
1 answer
1k views

I'm using OpenCL 2.0 dynamic parallelism feature and have each workitem enqueue another kernel with single workitem. When work completion time of child kernel is high, parent kernel completes before ...
huseyin tugrul buyukisik's user avatar
1 vote
1 answer
801 views

Kernel codes that produce the error: __kernel void testDynamic(__global int *data) { int id=get_global_id(0); atomic_add(&data[1],2); } __kernel void test(__global int * data) { int ...
huseyin tugrul buyukisik's user avatar
7 votes
1 answer
758 views

I am trying to call cudaMemsetAsync from kernel (so called "dynamic parallelism"). But no matter what value I use, it always set memory to 0. Here is my test code: #include "cuda_runtime.h" #include ...
Xiang Zhang's user avatar
  • 2,983
0 votes
1 answer
401 views

Question 1: Do I have to specify the amount of dynamic shared memory to be allocated at the launch of parent kernel if shared memory is only used by child kernel. Question 2: The following is my ...
Aliya Clark's user avatar
3 votes
1 answer
400 views

When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, ...
einpoklum's user avatar
  • 138k
0 votes
1 answer
1k views

While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able ...
einpoklum's user avatar
  • 138k
0 votes
1 answer
897 views

I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code: using ...
starrr's user avatar
  • 1,033
-2 votes
1 answer
143 views

To test out dynamic parallelism, I wrote a simple code and compiled it on GTX1080 with the following commands. nvcc -arch=sm_35 -dc dynamic_test.cu -o dynamic_test.o nvcc -arch=sm_35 dynamic_test....
JYC's user avatar
  • 1
2 votes
1 answer
536 views

I have the following minimal .cu file #include <cuda_runtime_api.h> #include <cublas_v2.h> #include <cstdio> __global__ void test() { cublasHandle_t handle = nullptr; ...
Joe's user avatar
  • 6,857
1 vote
3 answers
3k views

I'm trying to compile a dynamic parallelism example on CUDA and when i try to compile it gives and error saying, kernel launch from __device__ or __global__ functions requires separate compilation ...
BAdhi's user avatar
  • 530
0 votes
1 answer
327 views

I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step. I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5. Depending on the switches ...
VincentN's user avatar
3 votes
1 answer
1k views

What I'm trying to do: On the GPU, I'm trying to mimic the conventions used by SQL in relational algebra to perform joins on tables (e.g. Inner Join, Outer Join, Cross Join). In the code below, I'm ...
aiwyn's user avatar
  • 278
0 votes
1 answer
173 views

I'm trying to use Kepler's Dynamic Parallelism for one of my application. The global index of the thread (in the parent kernel) launching the child kernel is needed in the child kernel. In other words,...
user3813674's user avatar
  • 2,683
0 votes
1 answer
678 views

I have a CUDA kernel that looks like the following: #include <cublas_v2.h> #include <math_constants.h> #include <stdio.h> extern "C" { __device__ float ONE = 1.0f; ...
Bam4d's user avatar
  • 610
1 vote
2 answers
2k views

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7. My code is so simple. When I try to compile ...
grypp's user avatar
  • 435
0 votes
1 answer
75 views

When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs. ...
mmdanziger's user avatar
  • 4,698
1 vote
2 answers
2k views

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,...
John's user avatar
  • 3,115
1 vote
1 answer
2k views

In here Robert Crovella said that cublas routines can be called from device code. Although I am using dynamic parallelism and compiling with compute capability 3.5, I cannot manage to call Cublas ...
emartel's user avatar
  • 49
0 votes
1 answer
229 views

I'm trying to link my CUDA Kepler's Dynamic Parallelism program as follows: nvcc -m32 -arch=sm_35 -dc -Xcompiler '-fPIC' DFS_Solving.cu nvcc -m32 -arch=sm_35 -Xcompiler '-fPIC' -dlink DFS_Solving.o -...
andersonbp's user avatar
-1 votes
1 answer
1k views

Although I have followed apendix C "Compiling Dynamic Parallelism" from "CUDA Programming Guide" and the solutions given here, I cannot manage to solve the problem I have. After the compilation and ...
emartel's user avatar
  • 49
9 votes
1 answer
10k views

I switched to a new GPU GeForce GTX 980 with cc 5.2, so it must support dynamic parallelism. However, I was not able to compile even a simple code (from programming guide). I will not provide it here (...
Mikhail Genkin's user avatar
3 votes
1 answer
1k views

I'm trying to compile and link a dynamic kernel and use it with the CUDA driver API on a GK110. I compile the .cu source file in Visual Studio with the relocatable device code flag and compute_35, ...
FHoenig's user avatar
  • 359
1 vote
1 answer
2k views

When using Dynamic Parallelism in CUDA, you can implement recursive algorithms like mergeSort. I have implemented it and my program don't work for inputs greater than blah. My question is how many ...
AmirSojoodi's user avatar
  • 1,360
2 votes
1 answer
2k views

I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait ...
Eugênio Fonseca's user avatar
1 vote
1 answer
1k views

My code is here: import numpy as np from numbapro import cuda @cuda.autojit def child_launch(data): data[cuda.threadIdx.x] = data[cuda.threadIdx.x] + 100 @cuda.autojit def parent_launch(data): ...
Ethan Huang's user avatar
0 votes
1 answer
367 views

I wrote a simple code to understand Dynamic Parallelism. From the values being printed,I see that the child kernel has executed correctly, but when I come back to the parent kernel, I see wrong values ...
Jagannath's user avatar
2 votes
0 answers
212 views

Under the CUDA Programming Guide section C.4.3.1.2. "Nesting and Synchronization Depth", it is mentioned: "An optimization is permitted where the system detects that it need not reserve space for ...
peteraldaron's user avatar
0 votes
1 answer
481 views

I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value. If the child kernel, say a different one, doesn't use texture memory, everything ...
salvaS's user avatar
  • 13
0 votes
1 answer
1k views

I have quite impressed with this deployment kit. Instead of buying a new CUDA card, which might require new main board and etc, this card seems provide all in one. At it's specs it says it has CUDA ...
phoad's user avatar
  • 1,881