Productivity
ebook-extractor avatar

ebook-extractor

Extract plain text from EPUB, MOBI, and PDF files for analysis or processing. Includes local support for all common ebook formats.

Introduction

The ebook-extractor skill provides a reliable, local-first solution for converting various ebook formats into plain text. Designed for users who need to process digital libraries, conduct research, or prepare content for analysis by other AI agents, this skill abstracts the complexity of file parsing. It leverages specialized Python libraries to ensure high-fidelity text retrieval without requiring expensive LLM token usage or network access, ensuring data privacy and performance for local workflows.

  • Automated format detection for EPUB, MOBI, and PDF files.

  • Utilizes robust libraries such as ebooklib and BeautifulSoup for structured EPUB parsing.

  • Integrates with Calibre's ebook-convert CLI to handle proprietary MOBI conversion requirements.

  • Employs PyMuPDF (fitz) for high-performance PDF text extraction.

  • Provides both a unified interface for batch processing and granular scripts for format-specific debugging.

  • Designed for command-line integration, allowing piped input and output to text files or standard streams.

  • Ensure the environment is prepared via the included setup.sh script, which manages dependency installation.

  • Note that some PDFs are image-based or contain scanned content; this tool will not perform OCR and will return minimal output for such files.

  • MOBI support requires the installation of the Calibre software package on the host system.

  • The tool is best suited for research-oriented tasks where plain text extraction is the primary goal, such as indexing documents, content auditing, or feeding raw text into RAG pipelines for further AI reasoning.

Repository Stats

Stars
36
Forks
7
Open Issues
4
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
May 1, 2026, 09:56 AM
View on GitHub