Engineering
parxy avatar

parxy

A unified document processing gateway for PDF parsing, text extraction, conversion, and document manipulation across multiple local and cloud providers.

Introduction

Parxy is a high-performance document processing gateway designed to provide a unified interface for complex document workflows. It abstracts the differences between various parsing backends, allowing developers and automated agents to switch between local libraries like PyMuPDF and Unstructured, or cloud-based services such as LlamaParse, LLMWhisperer, and PdfAct, without changing the underlying application logic. The core value of Parxy lies in its consistent hierarchical data model, which processes documents into a structured flow of pages, text blocks, lines, spans, and individual characters with precise bounding box coordinates and semantic role information. This makes it an ideal tool for AI-driven data extraction, RAG pipelines, and systematic document conversion tasks.

  • Unified API surface to swap between parsing engines like PyMuPDF, PdfAct, LlamaParse, LLMWhisperer, and Unstructured.

  • Hierarchical document model providing structural insights (paragraphs, headings) and spatial data (bbox coordinates).

  • Advanced PDF manipulation tools, including merging documents with page-range selection, splitting files into individual pages, and optimizing large PDFs (scrubbing metadata, subsetting fonts, compressing images).

  • Integrated batch processing capabilities for high-volume document ingest with parallel execution and streaming result handlers.

  • Robust command-line interface (CLI) for rapid prototyping, featuring a TUI for parser comparison, interactive document previewing, and direct markdown conversion.

  • Extensible architecture that allows developers to integrate custom parsers or handle specific PDF attachment extraction requirements.

  • Best suited for developers building data ingestion pipelines, research automation, or document management agents.

  • Requires Python 3.12+ and utilizes Pydantic v2 for data validation and schema safety.

  • Installation options include base packages or extended extras (e.g., [all], [llama], [unstructured_local]) to manage dependency footprint.

  • Expected inputs are predominantly PDF files, with support for converting complex layouts into structured JSON or Markdown formats.

  • Configuration is simplified through standard environment variables for API keys and an optional .env file setup for cloud service credentials.

Repository Stats

Stars
9
Forks
1
Open Issues
3
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
May 3, 2026, 04:07 PM
View on GitHub