parxy
A unified document processing gateway for PDF parsing, text extraction, conversion, and document manipulation across multiple local and cloud providers.
Introduction
Parxy is a high-performance document processing gateway designed to provide a unified interface for complex document workflows. It abstracts the differences between various parsing backends, allowing developers and automated agents to switch between local libraries like PyMuPDF and Unstructured, or cloud-based services such as LlamaParse, LLMWhisperer, and PdfAct, without changing the underlying application logic. The core value of Parxy lies in its consistent hierarchical data model, which processes documents into a structured flow of pages, text blocks, lines, spans, and individual characters with precise bounding box coordinates and semantic role information. This makes it an ideal tool for AI-driven data extraction, RAG pipelines, and systematic document conversion tasks.
-
Unified API surface to swap between parsing engines like PyMuPDF, PdfAct, LlamaParse, LLMWhisperer, and Unstructured.
-
Hierarchical document model providing structural insights (paragraphs, headings) and spatial data (bbox coordinates).
-
Advanced PDF manipulation tools, including merging documents with page-range selection, splitting files into individual pages, and optimizing large PDFs (scrubbing metadata, subsetting fonts, compressing images).
-
Integrated batch processing capabilities for high-volume document ingest with parallel execution and streaming result handlers.
-
Robust command-line interface (CLI) for rapid prototyping, featuring a TUI for parser comparison, interactive document previewing, and direct markdown conversion.
-
Extensible architecture that allows developers to integrate custom parsers or handle specific PDF attachment extraction requirements.
-
Best suited for developers building data ingestion pipelines, research automation, or document management agents.
-
Requires Python 3.12+ and utilizes Pydantic v2 for data validation and schema safety.
-
Installation options include base packages or extended extras (e.g., [all], [llama], [unstructured_local]) to manage dependency footprint.
-
Expected inputs are predominantly PDF files, with support for converting complex layouts into structured JSON or Markdown formats.
-
Configuration is simplified through standard environment variables for API keys and an optional .env file setup for cloud service credentials.
Repository Stats
- Stars
- 9
- Forks
- 1
- Open Issues
- 3
- Language
- Python
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 3, 2026, 04:07 PM