nvidia-resiliency-ext
Provides resiliency, health monitoring, and fault tolerance utilities for NVIDIA GPU-accelerated distributed applications, including process management and API key handling.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
138 skills found
Provides resiliency, health monitoring, and fault tolerance utilities for NVIDIA GPU-accelerated distributed applications, including process management and API key handling.
Diagnose, isolate, and mitigate LLM context failures like lost-in-middle, poisoning, distraction, and context clash to improve agent reliability.
Access AI-ready datasets, benchmarks, and molecular oracles for drug discovery, including ADME, toxicity, DTI, and molecular generation tasks.
Automate your entire Git lifecycle from commit and PR creation to CI monitoring and branch merging, enforcing conventional commits throughout.
Port Semgrep rules to new languages using a strict, test-driven methodology. Includes applicability analysis, AST-based translation, and automated validation for each target language.
A framework to transform experimental ML prototypes into robust, production-ready Python packages using src layout, hybrid architecture, and strict configuration management.
A RAG-based AI solver for high school Chinese GSAT exams, featuring structured knowledge retrieval, reasoning templates, and explainable AI outputs.
Pushes local changes in git submodules (hive or core-geth) to their respective remote forks, ensuring repository synchronization.
Process and generate multimedia with Google Gemini. Analyze audio, images, videos, and PDFs with high-context windows. Supports transcription, visual QA, OCR, and AI-driven image creation.
Architect and optimize production-grade RAG systems. Master embedding models, vector databases, chunking strategies, and retrieval pipelines for high-accuracy LLM applications.
Debugging guide for AReaL distributed training issues, including hangs, NCCL errors, OOM, and numerical consistency in FSDP2/TP/CP/EP.
Self-maintaining skill for OpenCode agents to update documentation, capture learnings, and extend tool/agent capabilities dynamically.