debug-distributed
Debugging guide for AReaL distributed training issues, including hangs, NCCL errors, OOM, and numerical consistency in FSDP2/TP/CP/EP.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
133 skills found
Debugging guide for AReaL distributed training issues, including hangs, NCCL errors, OOM, and numerical consistency in FSDP2/TP/CP/EP.
A specification-driven workflow management system for structured development lifecycle management, covering proposal, planning, implementation, and archival phases.
Executes a rigorous, multi-phase Fagan Inspection to systematically resolve persistent, stubborn bugs and complex code interactions.
Add evlog framework integration: automate wide-event logging across your stack with standardized middleware, build configurations, testing, and documentation.
Project bootstrap for Claude Code with safety guardrails, git workflow automation, project auditing, and structured multi-phase planning.
AI-optimized artifact tracking system for token-efficient project orchestration, phase management, and automated task delegation using YAML-Markdown hybrid formats.
Apply the Six Thinking Hats methodology to software testing for structured, comprehensive quality analysis, test strategy design, and team discussions.
Expert Swift code review for macOS/iOS. Detects memory leaks, threading bugs, concurrency issues, and accessibility gaps using parallel analysis agents.
Architectural guidance and pattern implementation for Java Spring Boot backends, covering REST API design, JPA, caching, async processing, and logging.
AI-powered Kubernetes and OpenShift troubleshooting. Proactively assess cluster health, debug pod failures, analyze logs, and validate security using Popeye-inspired patterns.
Tutorial for identifying and resolving CUDA runtime crashes using FlashInfer's API logging framework.
Native macOS/iOS app performance profiling via xctrace and CLI-based hotspot analysis without opening the Instruments UI.