evaluation
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
137 skills found
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Evaluate Deca agent prompts and behavioral consistency through automated test runners, manual LLM judgment, and structured reporting.
Advanced prompt rewriting and optimization service. Analyzes prompts for clarity, specificity, and structure, providing actionable improvements, variations for testing, and prompt engineering best practices.
Official evaluation framework for AI agent sessions, implementing Evaluation-Driven Development (EDD) principles to ensure reliability.
Structured manuscript and grant review assistant utilizing checklist-based evaluation for methodology, statistical validity, and compliance with reporting standards like CONSORT and STROBE.
Systematically evaluate scholarly work using the ScholarEval framework, providing structured, quantitative, and qualitative assessment across research quality dimensions with actionable feedback.
A systematic workflow to instrument, evaluate, and monitor LLM applications using TruLens, supporting frameworks like LangChain, LangGraph, and LlamaIndex.
Comprehensive AI-generated text detection framework. Features multi-layer analysis of vocabulary, structural patterns, model-specific fingerprints, and technical metadata artifacts to identify AI authorship.
Optimize agent performance and token usage through advanced context compression, structured summarization, and task-oriented state management for long-running sessions.
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Process and generate multimedia with Google Gemini. Analyze audio, images, videos, and PDFs with high-context windows. Supports transcription, visual QA, OCR, and AI-driven image creation.
Prevents AI hallucination and ensures evidence-based, verifiable outputs when analyzing code, reviewing technical documents, or providing recommendations.