evaluation
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
163 skills found
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Diagnose, isolate, and mitigate LLM context failures like lost-in-middle, poisoning, distraction, and context clash to improve agent reliability.
Expert assistant for designing and optimizing production-grade Trigger.dev background jobs, AI workflows, and resilient asynchronous task architectures in TypeScript.
Base ecosystem skill for Refly. Creates, discovers, and runs domain-specific skills, routes user intent to workflows via symlinks, and automates multi-step pipelines via the Refly CLI.
Implements an autonomous, critical self-verification layer for AI agents to validate code quality, security, and requirement alignment before task completion.
A framework for managing the end-to-end LLM project lifecycle, from evaluating task-model fit and pipeline architecture design to implementing structured output parsing and agent-assisted development.
A design system and anti-pattern guide to make AI-generated UI look human-crafted. Ensures professional aesthetics by managing color, typography, spacing, and animations for the Toh Framework.
Multi-LLM code review pipeline using consensus-based analysis to detect security, architectural, and quality issues.
Generate or edit images using AI models like FLUX and Gemini. Ideal for photos, illustrations, concept art, and visual assets, excluding technical diagrams and schematics.
Behavioral guidelines for LLMs to reduce coding mistakes, follow best practices, and improve output quality by enforcing simplicity, surgical changes, and goal-driven verification.
Standardizes project context by managing artifacts (product, tech-stack, workflow, tracks) in a conductor/ directory. Supports project scaffolding, artifact synchronization, and AI alignment for greenfield and brownfield projects.
A wise conductor of expert agents. It helps you achieve goals by summoning, orchestrating, and creating specialized AI experts. Features intellectual humility, multi-agent debate, and self-learning pattern capture.