Engineering
debug-distributed avatar

debug-distributed

Debugging guide for AReaL distributed training issues, including hangs, NCCL errors, OOM, and numerical consistency in FSDP2/TP/CP/EP.

Introduction

This skill provides a systematic framework for diagnosing and resolving complex distributed training issues within the AReaL infrastructure. It is designed for machine learning engineers and researchers working with large-scale model training, specifically utilizing technologies like FSDP2, Tensor Parallelism (TP), Context Parallelism (CP), and Expert Parallelism (EP). The guide helps users isolate failures when training processes hang, deadlock, produce diverging results across ranks, or encounter CUDA out-of-memory (OOM) and NCCL communication errors.

  • Principles for minimal reproduction: Techniques for isolating failing operations using small-scale tensor sizes and reduced world sizes to speed up the root-cause analysis process.

  • Hang and deadlock resolution: Detailed steps for analyzing process stalls, including the use of environment variables like TORCH_DISTRIBUTED_DEBUG and NCCL_DEBUG_SUBSYS, and utilizing py-spy to generate flame graphs and dump call stacks from hung processes.

  • Numerical consistency diagnostics: Tools for validating DTensor placements, checking gradient reduction across ranks, and identifying mismatches in collectives or process groups.

  • Memory optimization and OOM management: Methods for monitoring CUDA memory usage and verifying FSDP sharding coverage to prevent OOM errors.

  • Communication error troubleshooting: A lookup table for common NCCL and device mesh configuration errors, mapping specific exceptions to actionable remediation steps.

  • Always start by creating a minimal reproduction script rather than debugging within the full training loop to reduce variables.

  • Utilize rank-conditional printing and barrier synchronization to verify tensor shapes and device mesh membership across the training cluster.

  • For performance profiling, use py-spy to capture flame graphs, keeping in mind that duration should be sufficient for representative data collection.

  • Ensure environment variables like TORCH_LOGS=+dynamo,recompiles are set correctly when dealing with issues related to torch.compile or complex graph execution.

  • Consult the provided key files reference to navigate the internal AReaL architecture, specifically looking into parallel_dims.py for mesh configuration issues.

Repository Stats

Stars
5,126
Forks
485
Open Issues
70
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
May 1, 2026, 07:35 AM
View on GitHub