Need help finding very hard to reproduce bug in an OpenCL program #5080
-
|
Hello! I am using rocm 6.3.3 on OpenSuSE 15.5 with a Radeon Pro W7700.
This fault's attributes and status is the same every time it happens. Assuming this is a problem in one of my OpenCL kernels, what tools could I use to catch and investigate this very rarely-occuring problem? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 16 replies
-
|
Hi @claudiubalogh, thanks for reaching out! Pagefault bugs after long runs can be difficult to track down for sure. The log you provided didn't seem to reveal anything too specific unfortunately. We suggest trying the following "tricks" to get more information that hopefully reveals some more hints.
Hopefully, with these tools you can get more insights regarding what's going on in your program. Good luck :) Thanks! |
Beta Was this translation helpful? Give feedback.
Hi @claudiubalogh, thanks for reaching out! Pagefault bugs after long runs can be difficult to track down for sure. The log you provided didn't seem to reveal anything too specific unfortunately. We suggest trying the following "tricks" to get more information that hopefully reveals some more hints.
Set HSAKMT_DEBUG_LEVEL to 3~5: this reveals a trace in HSA, which is the ROCm runtime interfacing OpenCL and lower-level drivers. Higher level will reveal more information but will also generate more log. Given that your workload runs for a long time, it would probably be the best to start with a lower value. Once you get an idea what was happening around when the page fault occurs, you can …