Skip to content

Commit 95eb933

Browse files
syed-ahmedfacebook-github-bot
authored andcommitted
Adds CUDA C++11 and Profiling Notes (#21386)
Summary: Pull Request resolved: #21386 ghimport-source-id: 9430c76 Differential Revision: D15640102 Pulled By: ezyang fbshipit-source-id: 98a5efdea9b1de05207ebd3624cb20acda9fe96b
1 parent eadac84 commit 95eb933

File tree

1 file changed

+33
-0
lines changed

1 file changed

+33
-0
lines changed

CONTRIBUTING.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,39 @@ If you are working on the CUDA code, here are some useful CUDA debugging tips:
369369
slow down the build process for about 50% (compared to only `DEBUG=1`), so use wisely.
370370
2. `cuda-gdb` and `cuda-memcheck` are your best CUDA debugging friends. Unlike`gdb`,
371371
`cuda-gdb` can display actual values in a CUDA tensor (rather than all zeros).
372+
3. CUDA supports a lot of C++11 features such as, `std::numeric_limits`, `std::nextafter`,
373+
`std::tuple` etc. in device code. Many of such features are possible because of the
374+
[--expt-relaxed-constexpr](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#constexpr-functions)
375+
nvcc flag. There is a known [issue](https://github.com/ROCm-Developer-Tools/HIP/issues/374)
376+
that ROCm errors out on device code, which uses such stl functions.
377+
4. A good performance metric for a CUDA kernel is the
378+
[Effective Memory Bandwidth](https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/).
379+
It is useful for you to measure this metric whenever you are writing/optimizing a CUDA
380+
kernel. Following script shows how we can measure the effective bandwidth of CUDA `uniform_`
381+
kernel.
382+
```python
383+
import torch
384+
import time
385+
size = 128*512
386+
nrep = 100
387+
nbytes_read_write = 4 # this is number of bytes read + written by a kernel. Change this to fit your kernel.
388+
389+
for i in range(10):
390+
a=torch.Tensor(size).cuda().uniform_()
391+
torch.cuda.synchronize()
392+
start = time.time()
393+
# dry run to alloc
394+
out = a.uniform_()
395+
torch.cuda.synchronize()
396+
start = time.time()
397+
for i in range(nrep):
398+
out = a.uniform_()
399+
torch.cuda.synchronize()
400+
end = time.time()
401+
timec = (end-start)/nrep
402+
print("uniform, size, elements", size, "forward", timec, "bandwidth (GB/s)", size*(nbytes_read_write)*1e-9/timec)
403+
size *=2
404+
```
372405

373406

374407
Hope this helps, and thanks for considering to contribute.

0 commit comments

Comments
 (0)