eval
Evaluate Deca agent prompts and behavioral consistency through automated test runners, manual LLM judgment, and structured reporting.
Introduction
The Eval skill provides a comprehensive testing framework to ensure that the Deca AI agent adheres strictly to its foundational system prompts, specifically IDENTITY.md and SOUL.md. This skill is designed for developers and system maintainers who need to verify agent persona, safety protocols, and operational constraints through a repeatable behavioral evaluation cycle. By separating automated test execution from manual human-in-the-loop judgment, it ensures that qualitative agent performance is measured against objective criteria.
-
Executes test suites against the running Deca Gateway on port 7014 using a dedicated runner.
-
Supports modular test categories including Identity, Soul, and Agent-specific behavioral rules.
-
Facilitates manual LLM judgment of agent outputs with specific scoring guides and objective evaluation criteria.
-
Generates standardized Markdown reports summarizing performance metrics, pass/fail status, and qualitative reasoning.
-
Provides a robust framework for adding new test cases to the eval/cases/ directory without altering core agent logic.
-
Designed for behavioral verification, ensuring strict adherence to personality traits, safety warnings, and task execution rules.
-
Always start a fresh Gateway session before running evaluations to prevent context pollution from previous interactions.
-
Use the provided scoring guide (0-100) to maintain consistent evaluation standards across different model versions.
-
Ensure all metadata fields such as gitCommit and timestamp are preserved during the manual judgement phase.
-
Leverage the quickCheck mechanism for objective verification of string matches and keyword triggers in agent responses.
-
Requires local development environment access with Bun, as the workflow relies on specific runtime scripts within the eval/ directory.
Repository Stats
- Stars
- 1
- Forks
- 0
- Open Issues
- 0
- Language
- TypeScript
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 3, 2026, 11:02 PM