evaluation
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
243 skills found
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Orchestrates multi-agent iterative refinement for high-quality OpenClaw skill development, ensuring rigorous testing and lifecycle management.
Validate n8n expression syntax, perform context-aware testing, detect common pitfalls, and optimize data transformations within your workflows.
Autonomous recursive execution engine for indiiOS that manages task completion, state verification, and error handling.
Collaborative PR review using a swarm of three specialized AI agents (Correctness, Health, UX) that discuss findings and reach consensus before posting a structured summary with inline comments.
Analyze GA4 and GSC performance data with automated benchmarks, status indicators, and actionable content optimization insights.
Automated session cleanup and documentation tool. Proactively updates CLAUDE.md, detects automation patterns, extracts insights, and organizes pending tasks.
Redesign SaaS paywalls and upgrade screens to maximize conversion using the Upgrade Moment Method.
Verify that dotfiles are properly symlinked, synchronised, and configured across the system to ensure development environment health.
Neuropixels neural recording analysis toolkit. Provides end-to-end pipelines for SpikeGLX/OpenEphys data, Kilosort4 spike sorting, motion correction, quality metrics, and AI-assisted curation.
Home Assistant OS (HAOS) operations skill for agents. Features read-only diagnostics, automation design, health auditing, and safety-first configuration management.
Architects enterprise AI agents from structured specs, generating production-ready code, data flow diagrams, and platform-specific logic for ServiceNow, Salesforce, and Snowflake.