evaluating-code-models
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Discover reusable agent skills, browse implementation details, and find the right skill for your workflow.
123 skills found
Evaluate code generation models using BigCode Evaluation Harness. Benchmarks include HumanEval, MBPP, and MultiPL-E with pass@k metrics for multi-language coding models.
Build systematic evaluation frameworks for AI agents using multi-dimensional rubrics, LLM-as-a-judge, and regression testing to measure performance, quality, and context engineering effectiveness.
Evaluate Deca agent prompts and behavioral consistency through automated test runners, manual LLM judgment, and structured reporting.
A systematic workflow to instrument, evaluate, and monitor LLM applications using TruLens, supporting frameworks like LangChain, LangGraph, and LlamaIndex.
Bayesian modeling and probabilistic programming with PyMC. Build hierarchical models, perform MCMC sampling (NUTS), variational inference, and conduct rigorous model comparison using LOO and WAIC.
Statistical modeling and econometrics library for Python. Performs OLS, GLM, mixed models, ARIMA, diagnostics, and inference for rigorous scientific analysis.
Classical machine learning with scikit-learn. Use for classification, regression, clustering, dimensionality reduction, preprocessing, model evaluation, and building robust ML pipelines in Python.
Generate or edit images using AI models like FLUX and Gemini. Ideal for photos, illustrations, concept art, and visual assets, excluding technical diagrams and schematics.
Systematically evaluate scholarly work using the ScholarEval framework, providing structured, quantitative, and qualitative assessment across research quality dimensions with actionable feedback.
A comprehensive financial modeling suite for investment analysis, featuring DCF valuation, sensitivity testing, Monte Carlo simulations, and scenario planning.
Official evaluation framework for AI agent sessions, implementing Evaluation-Driven Development (EDD) principles to ensure reliability.
A suite of professional tools for auditing, evaluating, chunking, and scaffolding production-ready RAG pipelines within Claude Code.