math-extractor
Extracts mathematical content like definitions, theorems, and proofs from documents (PDF, MD, TEX, TXT) using AI-based cleaning and conversion.
Introduction
The Math Extractor is a specialized agent skill designed for researchers, students, and academics who need to isolate formal mathematical structures from complex documents. By automating the extraction of Definitions, Theorems, Lemmas, Propositions, and Proofs, this tool streamlines the process of building mathematical datasets, lecture notes, or reference libraries. It handles a variety of file formats, including PDF, Markdown, LaTeX, and plain text, ensuring that mathematical notation and logic are preserved during the extraction process.
-
Advanced PDF processing powered by MinerU for high-fidelity conversion to structured Markdown.
-
Intelligent content chunking that preserves paragraph integrity and mathematical formulas.
-
AI-driven filtering to aggressively remove non-essential document elements such as images, tables of contents, and long reference lists to optimize token usage.
-
Built-in support for mathematical inequalities and symbols by whitelisting essential HTML/Markdown tags to prevent loss of critical content.
-
Automated encoding detection, including UTF-8, GBK, and Latin-1, to ensure compatibility with global document sources.
-
The tool requires an extraction API key for LLM operations (e.g., GPT-4o) and optional MinerU credentials for PDF parsing.
-
Performance depends on the presence of environment variables like EXTRACTION_API_KEY and MINERU_API_KEY.
-
Users should expect the output as a clean, structured _extracted.md file in the designated output directory.
-
Ideal for batch processing large sets of technical papers or textbooks where manual extraction is impractical.
-
The script includes retry logic for API interactions to ensure robustness against network instability during parallel batch processing.
-
Best suited for formal mathematical writing where structure is consistent and clear.
Repository Stats
- Stars
- 0
- Forks
- 0
- Open Issues
- 0
- Language
- Python
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 3, 2026, 08:25 PM