Engineering
debug-cuda-crash avatar

debug-cuda-crash

Tutorial for identifying and resolving CUDA runtime crashes using FlashInfer's API logging framework.

Introduction

This skill provides a comprehensive guide for developers and AI engineers to debug complex CUDA-related crashes when using the FlashInfer library. CUDA errors—such as illegal memory access, out-of-bounds errors, NaN/Inf numerical instability, and OOM—are notoriously difficult to trace because they often terminate the execution environment immediately. This skill teaches users how to leverage the @flashinfer_api decorator to instrument code, enabling systematic visibility into tensor states and API call sequences prior to failure.

  • Capture input metadata, including tensor shapes, dtypes, device placement, and memory layout, before kernel execution.

  • Log internal tensor statistics, such as min, max, mean, and nan/inf counts, to pinpoint numerical instability in models.

  • Configure log levels and destinations, supporting both standard output and file-based logging for multi-process (e.g., torchrun) environments.

  • Identify structural issues like head dimension mismatches, incorrect data types (FP16 vs BF16), or CPU-to-GPU transfer errors.

  • Integrate with advanced external diagnostic tools like compute-sanitizer for deep-level hardware memory analysis.

  • Users should set environment variables like FLASHINFER_LOGLEVEL (ranging from 1 to 5) and FLASHINFER_LOGDEST to manage log verbosity and target output.

  • This guide assumes the user is working within a PyTorch-based inference pipeline using FlashInfer's optimized attention or GEMM kernels.

  • It is critical to use the %i pattern in the log destination path when performing multi-rank/multi-GPU debugging to prevent inter-process log corruption.

  • While the tool is optimized for catching runtime failures, it does not replace kernel-level profiling, which should be performed using the FlashInfer profiler tool for performance bottlenecks.

  • Ensure sufficient storage space if logging at Level 5 (tensor statistics) for high-frequency kernel calls to avoid disk I/O bottlenecks during execution.

Repository Stats

Stars
5,537
Forks
946
Open Issues
587
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
Apr 30, 2026, 04:07 PM
View on GitHub