kreuzberg

Introduction

Kreuzberg is a high-performance, polyglot document intelligence library built with a Rust core, designed to handle large-scale data extraction across 91+ file formats, including PDF, Office documents, archives, images, HTML, and email. It is intended for software engineers, data scientists, and automation specialists who need to integrate document parsing directly into their applications. Whether you are building RAG pipelines, automating document intake, or performing structural analysis on legacy academic documents, Kreuzberg provides the primitives for reliable, high-fidelity data extraction without requiring specialized hardware like GPUs.

The library excels at handling complex document structures, leveraging tree-sitter for code intelligence across 248 programming languages, and offering native bindings for major languages including Python, Node.js, Rust, Java, C#, Go, Ruby, Elixir, R, and C. By using its high-speed Rust backend, it provides consistent parsing results across different environments. It also includes advanced LLM integration, enabling structured JSON output and token-efficient serialization formats like TOON, which significantly reduce context window consumption in RAG and LLM workflows.

Multi-engine OCR support including Tesseract, PaddleOCR, EasyOCR, and integration with 146 vision model providers for VLM-based optical character recognition.
Code intelligence extraction including functions, classes, and symbols for 248 programming languages using tree-sitter.
Memory-efficient streaming parsing capable of handling multi-GB documents at high performance.
Comprehensive document intelligence features covering table extraction, metadata parsing, and semantic chunking.
Flexible deployment options including library-native bindings, a production-grade CLI tool, a REST API server, and an MCP (Model Context Protocol) server.
GFM-quality output conversion supporting Markdown, HTML, Djot, and plain text with proper cross-format handling.
Use the CLI for batch processing of local document stores or integrate directly via language-specific SDKs for real-time extraction.
Configure complex tasks via TOML files or dynamic config objects, covering password-protected PDFs, custom OCR backends, and post-processing plugins.
Inputs include local files or binary streams across 91+ file types; outputs include extracted text, structural JSON/TOON data, or LLM-friendly markdown strings.
Ensure ONNX Runtime is installed for features requiring PaddleOCR; utilize feature flags in Rust (e.g., tokio-runtime, pdf) to optimize binary size and runtime dependencies.

Startup Courses

Online Courses

Physical Courses

Introduction

Repository Stats