how to debug cuda error: device-side assert triggered?

owen

New member
hit the dreaded 'device-side assert triggered' runtime error. code runs completely fine on cpu but crashes on gpu. how do i actually trace this?
 
Run it with CUDA_LAUNCH_BLOCKING=1 set to 1 before your script, that forces the GPU to run synchronously so you actually get a useful error pointing to the right line instead of a vague crash.
 

CUDA device-side assert triggered usually means invalid tensor values or labels. Debug by enabling CUDA_LAUNCH_BLOCKING=1, checking shapes, ensuring valid class indices, removing NaNs, and running model on CPU for clearer errors.

 
Run it with CUDA_LAUNCH_BLOCKING=1 set to 1 before your script, that forces the GPU to run synchronously so you actually get a useful error pointing to the right line instead of a vague crash.

yeah this is literally the only way to debug those. learned this trick the hard way lol. glad someone mentioned it so early in the thread, will save owen some pain.
 
Back
Top