evaluation
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
206 skills found
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Manage project SSOT, memory, and cross-tool search. Guardian of decisions.md and patterns.md for Claude Code. Use for context retention, memory synchronization, and decision tracking.
Automated toolkit for creating, maintaining, and enhancing CLAUDE.md files to ensure your project's AI-assisted development guidelines are always accurate, modular, and best-practice compliant.
Advanced prompt rewriting and optimization service. Analyzes prompts for clarity, specificity, and structure, providing actionable improvements, variations for testing, and prompt engineering best practices.
A systematic, multi-angle web research agent. Use for deep investigation, complex queries, and as a mandatory pre-research step before content generation to ensure evidence-backed, high-quality results.
A standardized workflow for converting raw PM notes, workshops, or rough drafts into polished, validated, and repository-compliant AI skills.
Perform a structured 8-factor conversion rate optimization (CRO) audit of any landing page to identify friction points and opportunities for growth.
Security-first vetting protocol for AI agent skills. Detects red flags like credential theft, obfuscated code, and unauthorized data exfiltration before installation.
Automated GitHub PR review agent for code quality, security analysis, and standard compliance using gh CLI.
Safely execute, test, and verify commands discovered in documentation with real output capture, performance tracking, and git-aware safety protocols.
Autonomous recursive execution engine for indiiOS that manages task completion, state verification, and error handling.
Behavioral guidelines for LLMs to reduce coding mistakes, follow best practices, and improve output quality by enforcing simplicity, surgical changes, and goal-driven verification.