evaluating-code-models
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
527 skills found
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Provides targeted, concise English language editing and stylistic improvements for text without performing full rewrites.
Epsimo AI platform SDK and CLI for building agents with persistent state, Virtual Database, streaming conversations, and a React UI kit.
Create and test AI-ready MCP tools for any web application. Inject code, automate browser interactions, and turn websites into intelligent agents.
Proactive context window management for AI agents via intelligent token monitoring, snapshot creation, and selective state rehydration to maintain continuity during long sessions.
Automated security auditing for project dependencies. Scans package files (npm, pip, maven, etc.) for vulnerabilities, CVEs, and license issues, offering automated fix suggestions and integration for secure deployment workflows.
Build and manage MCP servers using the FastMCP framework. Guide for creating tools, resources, prompts, Claude Desktop integration, and deployment with Python and TypeScript.
Automates the creation and maintenance of CLAUDE.md files. It monitors codebase evolution and keeps project memory in sync with file changes, structure, and build commands.
Implement a full Model Context Protocol (MCP) stack in Rails. Connect to external servers, expose your Rails app as an MCP server, or manage subprocess MCP containers via Docker with OAuth 2.1 PKCE support.
Comprehensive guide for scaffold, configure, and structure gitagent projects. Manage agent.yaml, SOUL.md, RULES.md, and project directory layouts.
Expert Microsoft 365 tenant administration skill for setup, user lifecycle, security policy configuration, compliance, and automated PowerShell scripting for Global Administrators.
Stress-test existing product feature ideas by identifying risky assumptions across Value, Usability, Viability, and Feasibility using a multi-perspective devil's advocate framework.