56 questions
1
vote
1
answer
110
views
What is the actual maximum nesting depth of dynamic parallelism in CUDA?
Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase ...
3
votes
1
answer
139
views
Does Clang support dynamic parallelism in cuda?
Dynamic parallelism means kernels calls kernels. Its possible to compile CUDA program using clang, but do clang support dynamic parallelism ?
I am getting this error when attempting to compile a CUDA ...
3
votes
0
answers
100
views
CUDA Dynamic Parallelism Synchronization (CUDA > 11.6)
I need to use a function like cudaDeviceSynchronize to wait for a kernel to finish execution. However after version 11.6, it is no longer possible to use any form of synchronization within device ...
0
votes
1
answer
250
views
Can I get the device-side pointer to a CUDA kernel using its mangled symbol name?
Note: In case the post seems long, one can directly jump to the section starting with "I was wondering.." at the end, in case one wants to skip the buildup/context.
Buildup/Context:
For the ...
0
votes
1
answer
199
views
How to ensure that a child kernel finished processing before the parent kernel continues?
I am learning CUDA and decided to do a basic image box blur demo as a way to get familiar.
When I try to make a child kernel compute the sum of the neighboring pixels, the sum is always 0.
After some ...
3
votes
1
answer
185
views
CUDA dynamic parallelism -- Is there a way to infinitely nest kernel launches?
Now, I'm using CUDA dynamic parallelism to create the kernel in a kernel function.
In the CUDA document, kernel functions can only be launched a fixed recursion depth because of resource constraints.
...
0
votes
1
answer
317
views
CUDA dynamic parallelism is computing sequentially
I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t ...
0
votes
1
answer
469
views
How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?
So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a ...
0
votes
1
answer
2k
views
Can I copy files from Sharepoint to Azure Blob Storage using dynamic file path?
I am building a pipeline to copy files from Sharepoint to Azule Blob Storage at work.
After reading some documentation, I was able to create a pipeline that only copies certain files.
However, I would ...
0
votes
1
answer
441
views
CUDA dynamic parallelism: Access child kernel results in global memory
I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
...
1
vote
1
answer
615
views
Why can't I link to my CUDA static library that uses Dynamic Parallelism and Separable Compilation?
I'm trying to create the most basic CUDA application to demonstrate Dynamic Parallelism, Separate Compilation and Linking, a CUDA kernel in a static library, and I'm trying to use CMake to generate a ...
0
votes
1
answer
691
views
Can a CUDA parent kernel launch a child kernel with more threads than the parent?
I'm trying to learn how to use CUDA Dynamic Parallelism.
I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with ...
0
votes
1
answer
2k
views
Why is cudaLaunchCooperativeKernel() returning not permitted?
So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so ...
1
vote
1
answer
1k
views
How to call a Thrust function in a stream from a kernel?
I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be ...
0
votes
1
answer
200
views
Nvidia visual profiler not showing cudaMalloc() after kernel launch
I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic ...
7
votes
0
answers
202
views
AleaGPU Dynamic Parallelism in F#? How?
This might be a simple question, but I have not been able to find any references to this topic: How do I launch a kernel from within another kernel?. The only relevant example I came across is the ...
0
votes
1
answer
491
views
Synchronizing depth of nested kernels
Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x child kernels in different streams to maximize parallel throughput. We then ...
0
votes
1
answer
882
views
compile multiple cuda files (that have dynamic parallelism) and MPI code
I have a bunch of .cu files that use dynamic parallelism (a.cu, b.cu, c.cu.., e.cu, f.cu), and a main.c file that uses MPI to call functions from a.cu on multiple nodes. I'm trying to write a make ...
1
vote
1
answer
982
views
Synchronization in CUDA dynamic parallelism
I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion:
__global__ void getMax(...
3
votes
1
answer
195
views
Dynamic Parallelism on GTX 980 ti: Unknown Error
I am attempting dynamic parallelism on a GTX 980 ti card.
All attempts at running code return "unknown error".
Simple code is shown below with compilation options.
I can execute kernels at depth=0 ...
7
votes
2
answers
6k
views
CUDA Dynamic Parallelism, bad performance
We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach.
We made the simplest reproducible code to ...
0
votes
1
answer
1k
views
How can I synchronize device-side command queues with host-side queues? clFinish() and markerWithWaitList gives invalid queue error
I'm using OpenCL 2.0 dynamic parallelism feature and have each workitem enqueue another kernel with single workitem. When work completion time of child kernel is high, parent kernel completes before ...
1
vote
1
answer
801
views
CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism
Kernel codes that produce the error:
__kernel void testDynamic(__global int *data)
{
int id=get_global_id(0);
atomic_add(&data[1],2);
}
__kernel void test(__global int * data)
{
int ...
7
votes
1
answer
758
views
CUDA device runtime api cudaMemsetAsync doesn't work
I am trying to call cudaMemsetAsync from kernel (so called "dynamic parallelism"). But no matter what value I use, it always set memory to 0.
Here is my test code:
#include "cuda_runtime.h"
#include ...
0
votes
1
answer
401
views
Using shared memory in Dynamic Parallelism CUDA
Question 1:
Do I have to specify the amount of dynamic shared memory to be allocated at the launch of parent kernel if shared memory is only used by child kernel.
Question 2:
The following is my ...
3
votes
1
answer
400
views
What factors effect the overhead of dynamic parallelism kernel launches?
When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, ...
0
votes
1
answer
1k
views
Dynamic parallelism - passing contents of shared memory to spawned blocks?
While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able ...
0
votes
1
answer
897
views
"device-function-maxrregcount" message while compiling cuda code
I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code:
using ...
-2
votes
1
answer
143
views
CUDA Dynamic Parallelism Deferencing Global Memory
To test out dynamic parallelism, I wrote a simple code and compiled it on GTX1080 with the following commands.
nvcc -arch=sm_35 -dc dynamic_test.cu -o dynamic_test.o
nvcc -arch=sm_35 dynamic_test....
2
votes
1
answer
536
views
"unknown error" on first cudaMalloc if CUBLAS is present in kernel
I have the following minimal .cu file
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
#include <cstdio>
__global__ void test()
{
cublasHandle_t handle = nullptr;
...
1
vote
3
answers
3k
views
Generating Relocatable Device Code using Nvidia Nsight
I'm trying to compile a dynamic parallelism example on CUDA and when i try to compile it gives and error saying,
kernel launch from __device__ or __global__ functions requires separate compilation ...
0
votes
1
answer
327
views
Trouble compiling/running CUDA code involving dynamic parallelism
I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step.
I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5.
Depending on the switches ...
3
votes
1
answer
1k
views
How to perform relational join on two data containers on GPU (preferably CUDA)?
What I'm trying to do:
On the GPU, I'm trying to mimic the conventions used by SQL in relational algebra to perform joins on tables (e.g. Inner Join, Outer Join, Cross Join). In the code below, I'm ...
0
votes
1
answer
173
views
Accessing CUDA built-in variable in child kernel
I'm trying to use Kepler's Dynamic Parallelism for one of my application. The global index of the thread (in the parent kernel) launching the child kernel is needed in the child kernel. In other words,...
0
votes
1
answer
678
views
Cublas not working within kernel once compiled to cubin using -G flag with nvcc
I have a CUDA kernel that looks like the following:
#include <cublas_v2.h>
#include <math_constants.h>
#include <stdio.h>
extern "C" {
__device__ float ONE = 1.0f;
...
1
vote
2
answers
2k
views
Nested Directives in OpenACC
I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.
My code is so simple. When I try to compile ...
0
votes
1
answer
75
views
Do kernel-launched child kernels have the same warp size as host-launched kernels?
When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs.
...
1
vote
2
answers
2k
views
Understanding Dynamic Parallelism in CUDA
Example of dynamic parallelism:
__global__ void nestedHelloWorld(int const iSize,int iDepth) {
int tid = threadIdx.x;
printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,...
1
vote
1
answer
2k
views
Is it possible to call cublas functions from a device function?
In here Robert Crovella said that cublas routines can be called from device code. Although I am using dynamic parallelism and compiling with compute capability 3.5, I cannot manage to call Cublas ...
0
votes
1
answer
229
views
cuda dynamic parallelism linkage error extern c
I'm trying to link my CUDA Kepler's Dynamic Parallelism program as follows:
nvcc -m32 -arch=sm_35 -dc -Xcompiler '-fPIC' DFS_Solving.cu
nvcc -m32 -arch=sm_35 -Xcompiler '-fPIC' -dlink DFS_Solving.o -...
-1
votes
1
answer
1k
views
Dynamic Parallelism - separate compilation: undefined reference to __cudaRegisterLinkedBinary
Although I have followed apendix C "Compiling Dynamic Parallelism" from "CUDA Programming Guide" and the solutions given here, I cannot manage to solve the problem I have. After the compilation and ...
9
votes
1
answer
10k
views
compilation .cu files with Dynamic Parallelism(CUDA)
I switched to a new GPU GeForce GTX 980 with cc 5.2, so it must support dynamic parallelism. However, I was not able to compile even a simple code (from programming guide). I will not provide it here (...
3
votes
1
answer
1k
views
CUDA dynamic parallelism with Driver API
I'm trying to compile and link a dynamic kernel and use it with the CUDA driver API on a GK110.
I compile the .cu source file in Visual Studio with the relocatable device code flag and compute_35, ...
1
vote
1
answer
2k
views
CUDA recursion depth
When using Dynamic Parallelism in CUDA, you can implement recursive algorithms like mergeSort. I have implemented it and my program don't work for inputs greater than blah.
My question is how many ...
2
votes
1
answer
2k
views
CUDA - How to make thread in kernel wait for it's children
I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait ...
1
vote
1
answer
1k
views
numba.typeinfer.TypingError: Untyped global name 'child_launch' when using CUDA Dynamic Parallelism in Python ( Anaconda ) on NVIDIA GPU
My code is here:
import numpy as np
from numbapro import cuda
@cuda.autojit
def child_launch(data):
data[cuda.threadIdx.x] = data[cuda.threadIdx.x] + 100
@cuda.autojit
def parent_launch(data):
...
0
votes
1
answer
367
views
Dynamic Parallelism in CUDA not working
I wrote a simple code to understand Dynamic Parallelism. From the values being printed,I see that the child kernel has executed correctly, but when I come back to the parent kernel, I see wrong values ...
2
votes
0
answers
212
views
Does nvcc support tail call optimization in dynamic parallelism?
Under the CUDA Programming Guide section C.4.3.1.2. "Nesting and Synchronization Depth", it is mentioned:
"An optimization is permitted where the system detects that it need not reserve space for ...
0
votes
1
answer
481
views
CUDA dynamic parallelism: invalid global write when using texture memory [closed]
I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value.
If the child kernel, say a different one, doesn't use texture memory, everything ...
0
votes
1
answer
1k
views
Nvidia Jetson TK1 Development Board - Cuda Compute Capability
I have quite impressed with this deployment kit. Instead of buying a new CUDA card, which might require new main board and etc, this card seems provide all in one.
At it's specs it says it has CUDA ...