evaluation
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
343 skills found
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Maintains a centralized architecture overview with Mermaid diagrams to document system boundaries, module dependencies, and interface contracts for onboarding and refactoring.
Creative research ideation partner for exploring interdisciplinary connections, challenging assumptions, and generating testable scientific hypotheses.
Unified CLI tool to read, query, discover, and write AI agent conversations using the agents:// URI scheme across multiple coding agents and providers.
Automated LinkedIn lead generation for tech services. Identifies non-tech founders, performs website gap analysis, and generates professional PDF audit reports for high-value B2B outreach.
Master workflow controller for Lovable-style, AI-driven development. Instantly generates premium, multi-page, animated applications by routing to specialized sub-agents. No prompts needed—just build.
Comprehensive guide for scaffold, configure, and structure gitagent projects. Manage agent.yaml, SOUL.md, RULES.md, and project directory layouts.
Intelligent Apple Mail inbox scanner that categorizes unread, actionable, and priority emails using automated keyword analysis.
Fullstack development agent for bkend.ai BaaS. Automates project init, auth/db setup, and API integration for Next.js applications.
Self-modify your Milady agent by managing plugins. Edit code, rebuild, and restart the runtime to develop new capabilities or improve agent workflows locally.
Fetches expert perspectives from OpenAI Codex and Google Gemini for architecture, code reviews, and debugging, with transparent LLM synthesis.
Analyze Claude Code session history to identify inefficiencies, optimize token usage, and suggest workflow improvements.