eval-harness
Official evaluation framework for AI agent sessions, implementing Evaluation-Driven Development (EDD) principles to ensure reliability.
Introduction
The Eval Harness is a comprehensive system designed to enforce high-quality standards in AI-assisted software development through Evaluation-Driven Development (EDD). By treating evaluations as the unit tests for AI agents, it allows developers to define success criteria, regression suites, and reliability metrics before implementing features. This framework is essential for teams aiming to move beyond probabilistic generation toward predictable, reliable agentic workflows. It supports multiple grading methodologies, including deterministic code-based assertions, LLM-based model graders, and structured human review processes, ensuring that every AI-generated contribution is validated against project requirements.
-
Define capability-based tests to confirm the agent can execute new, complex logic tasks.
-
Implement regression suites to prevent code drift and ensure that previously solved problems remain stable.
-
Utilize pass@k and pass^k metrics to statistically measure the reliability and success rate of agent responses.
-
Integrate seamlessly into the development lifecycle with pre-coding definition phases and post-coding reporting.
-
Manage evaluation artifacts within the .claude/evals/ directory for version control and persistent audit logs.
-
Always define your evaluation criteria in a markdown document before writing any code to ensure clear success boundaries.
-
Use deterministic code-based graders for build, test, and regex pattern checks whenever possible to avoid unnecessary LLM overhead.
-
Apply model-based graders for qualitative tasks such as checking code structure, edge-case coverage, and appropriate error handling.
-
Maintain a history of runs to track reliability trends over time; failing to monitor pass@k metrics can lead to undetected degradation in model performance.
-
Never rely solely on automated checks for security-critical modules; always include an explicit manual review step when assessing high-risk changes.
Repository Stats
- Stars
- 169,888
- Forks
- 26,327
- Open Issues
- 185
- Language
- JavaScript
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- Apr 29, 2026, 01:07 PM