debug-cuda-crash

Introduction

This skill provides a comprehensive guide for developers and AI engineers to debug complex CUDA-related crashes when using the FlashInfer library. CUDA errors—such as illegal memory access, out-of-bounds errors, NaN/Inf numerical instability, and OOM—are notoriously difficult to trace because they often terminate the execution environment immediately. This skill teaches users how to leverage the @flashinfer_api decorator to instrument code, enabling systematic visibility into tensor states and API call sequences prior to failure.

Capture input metadata, including tensor shapes, dtypes, device placement, and memory layout, before kernel execution.
Log internal tensor statistics, such as min, max, mean, and nan/inf counts, to pinpoint numerical instability in models.
Configure log levels and destinations, supporting both standard output and file-based logging for multi-process (e.g., torchrun) environments.
Identify structural issues like head dimension mismatches, incorrect data types (FP16 vs BF16), or CPU-to-GPU transfer errors.
Integrate with advanced external diagnostic tools like compute-sanitizer for deep-level hardware memory analysis.
Users should set environment variables like FLASHINFER_LOGLEVEL (ranging from 1 to 5) and FLASHINFER_LOGDEST to manage log verbosity and target output.
This guide assumes the user is working within a PyTorch-based inference pipeline using FlashInfer's optimized attention or GEMM kernels.
It is critical to use the %i pattern in the log destination path when performing multi-rank/multi-GPU debugging to prevent inter-process log corruption.
While the tool is optimized for catching runtime failures, it does not replace kernel-level profiling, which should be performed using the FlashInfer profiler tool for performance bottlenecks.
Ensure sufficient storage space if logging at Level 5 (tensor statistics) for high-frequency kernel calls to avoid disk I/O bottlenecks during execution.

Startup Courses

Online Courses

Physical Courses

Introduction

Repository Stats