evaluation
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
137 skills found
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Generate finite-difference stencils, select optimal numerical schemes for PDEs/ODEs, and perform truncation error analysis to improve simulation accuracy.
Query the Pollinations text API with web-search enabled models like Gemini and Perplexity for grounded, real-time research.
A professional bug bounty reporting agent that enforces impact-first writing, CVSS 3.1 scoring, and pre-submit validation for platforms like HackerOne, Bugcrowd, and Intigriti.
Generate high-quality text-to-speech audio using Microsoft Edge's neural TTS service. Supports multiple languages, voices, and adjustable audio parameters.
Guides agent memory system implementation, compares frameworks (Mem0, Zep, Letta, LangMem, Cognee), and designs persistence architectures for cross-session knowledge retention.
Evidence-first literature collector for automated research pipelines. Scales paper pools to 1200+ with metadata normalization, provenance tracking, and multi-source ingestion.
Standardize frontend communication by documenting data requirements and business rules for backend developers, ensuring clear alignment without dictating implementation details.
AI-powered video editing agent for talking head videos, featuring speech-to-text, disfluency detection, and browser-based review workflows.
Normalizes testing defect logs by correcting typos, abbreviations, and ambiguous descriptions based on product-specific codebooks and station validation.
Tutorial for identifying and resolving CUDA runtime crashes using FlashInfer's API logging framework.
Verify research idea novelty against recent literature. Use when user says '查新', 'novelty check', or needs to confirm if a method is original.