evaluating-code-models
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
187 skills found
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Operate Google Tag Manager via MCP. Handles OAuth, resource discovery, and CRUD operations for tags, triggers, and variables directly from your LLM agent.
Integrate Snowflake with MCP clients. Manage Snowflake endpoints, validate connectivity, and leverage Cortex AI (Search, Analyst, Agent) services directly within your AI workflow.
Token-efficient virtual task management for AI-assisted development. Manage task lifecycles, dependencies, and TDD workflows with surgical context injection.
Crawl websites to extract content as clean markdown files. Ideal for documentation, research, and offline knowledge management.
Generates a random lucky number between 0 and 9999 for games, decision-making, or entertainment.
Unified AI gateway for 100+ LLMs with OpenAI-compatible API, model fallbacks, load balancing, and enterprise-grade tools.
A RAG-based AI solver for high school Chinese GSAT exams, featuring structured knowledge retrieval, reasoning templates, and explainable AI outputs.
Audit and synchronize the supported LLM model list in assets.py against the authoritative litellm registry.
Token-efficient codebase analysis skill for call graphs, semantic search, impact analysis, and data flow. Saves ~95% tokens vs. raw reads.
A testing fixture for validating AI agent skill configurations and detecting rule violations.
Package entire code repositories into single, AI-optimized files. Ideal for providing codebase context to LLMs like Claude, ChatGPT, and Gemini for analysis, security audits, and bug investigations.