@@ -369,6 +369,39 @@ If you are working on the CUDA code, here are some useful CUDA debugging tips:
369369 slow down the build process for about 50% (compared to only ` DEBUG=1 ` ), so use wisely.
3703702 . ` cuda-gdb ` and ` cuda-memcheck ` are your best CUDA debugging friends. Unlike` gdb ` ,
371371 ` cuda-gdb ` can display actual values in a CUDA tensor (rather than all zeros).
372+ 3 . CUDA supports a lot of C++11 features such as, ` std::numeric_limits ` , ` std::nextafter ` ,
373+ ` std::tuple ` etc. in device code. Many of such features are possible because of the
374+ [ --expt-relaxed-constexpr] ( https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#constexpr-functions )
375+ nvcc flag. There is a known [ issue] ( https://github.com/ROCm-Developer-Tools/HIP/issues/374 )
376+ that ROCm errors out on device code, which uses such stl functions.
377+ 4 . A good performance metric for a CUDA kernel is the
378+ [ Effective Memory Bandwidth] ( https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/ ) .
379+ It is useful for you to measure this metric whenever you are writing/optimizing a CUDA
380+ kernel. Following script shows how we can measure the effective bandwidth of CUDA ` uniform_ `
381+ kernel.
382+ ``` python
383+ import torch
384+ import time
385+ size = 128 * 512
386+ nrep = 100
387+ nbytes_read_write = 4 # this is number of bytes read + written by a kernel. Change this to fit your kernel.
388+
389+ for i in range (10 ):
390+ a= torch.Tensor(size).cuda().uniform_()
391+ torch.cuda.synchronize()
392+ start = time.time()
393+ # dry run to alloc
394+ out = a.uniform_()
395+ torch.cuda.synchronize()
396+ start = time.time()
397+ for i in range (nrep):
398+ out = a.uniform_()
399+ torch.cuda.synchronize()
400+ end = time.time()
401+ timec = (end- start)/ nrep
402+ print (" uniform, size, elements" , size, " forward" , timec, " bandwidth (GB/s)" , size* (nbytes_read_write)* 1e-9 / timec)
403+ size *= 2
404+ ```
372405
373406
374407Hope this helps, and thanks for considering to contribute.
0 commit comments