evaluating-code-models
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
139 skills found
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Multi-perspective AI consultation for technical architecture, complex refactoring, and structured debugging.
Gate 2 development cycle skill that validates observability implementation, including structured logging, OpenTelemetry tracing, and instrumentation coverage, without modifying code.
Implements an autonomous, critical self-verification layer for AI agents to validate code quality, security, and requirement alignment before task completion.
Convert markdown PRDs into structured prd.json files for the Ralph autonomous AI agent system to enable repeatable, context-aware software development.
Review Hyperlane documentation changes against project standards, ensuring compliance with architectural patterns and content guidelines.
Expert automated code review for Go CLI applications, focusing on Cobra/urfave patterns, security, performance, idiomatic Go, and robust error handling.
Analyze project structures, dependencies, and patterns using parallel agent execution to generate comprehensive context documentation for rapid codebase onboarding and AI-assisted development.
Execute implementation plans in separate sessions with review checkpoints, ensuring task-by-task verification and robust code quality.
Automated code review for STYLY-NetSync, enforcing protocol parity, thread safety, and Unity C#/Python conventions.
Implements UI components from Figma/mockups with pixel-perfect accuracy, intelligent design validation, and adaptive agent switching.
Token-efficient codebase analysis skill for call graphs, semantic search, impact analysis, and data flow. Saves ~95% tokens vs. raw reads.