evaluating-code-models
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
473 skills found
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
The final execution agent for the vibe-coding workflow. Builds your MVP incrementally by following the AGENTS.md master plan, managing session continuity, and verifying each feature via testing.
Comprehensive biosignal processing toolkit for ECG, EEG, EDA, RSP, PPG, EMG, and EOG signal analysis, enabling psychophysiology research and multi-modal integration.
Search, analyze, and audit GeminiClaw session logs and memory. Use to investigate past interactions, track token usage, debug tool calls, and monitor agent performance.
Mandatory execution-based validation for all software implementation tasks. Ensures code works through empirical verification before confirmation.
Professional-grade spreadsheet automation for Claude: create, edit, analyze, and visualize Excel and CSV files with rigorous formula integrity and financial formatting standards.
Systematically improve marketing copy through a 7-pass editing framework to boost clarity, tone, and conversion impact.
Production-grade FFmpeg automation for video and audio processing, including trimming, concatenation, format conversion, codec optimization, and filter application.
AI-driven GitHub Actions automation featuring swarm-based workflow orchestration, intelligent CI/CD pipeline management, and autonomous repository maintenance.
A structured decision-making tool that applies RICE, MoSCoW, Kano, and value-effort frameworks to prioritize software features, roadmap items, and build-vs-defer decisions with data-driven objectivity.
Automates GitHub release creation by generating formatted changelogs from conventional commits and managing version bumps.
Correlate content attributes with GA4 and GSC metrics to identify performance drivers and optimization opportunities.