Research
split-pdf avatar

split-pdf

Automated pipeline to download, split, and deeply analyze academic PDFs in structured batches to avoid context window limits and ensure high-quality comprehension.

Introduction

The split-pdf skill is a specialized research tool designed to overcome the limitations of large language models when dealing with long-form academic documents. By automatically fragmenting extensive PDFs into manageable four-page chunks, it enables an iterative, deep-reading process that systematically builds comprehensive structured notes. This tool is ideal for researchers, students, and analysts who need to process papers, book chapters, or long technical reports without encountering unrecoverable context window errors or shallow, hallucinated summaries.

  • Automatically acquires academic papers from local file paths or web search queries using WebSearch and WebFetch tools.

  • Implements a rigorous splitting protocol using PyPDF2 to create four-page chunks, stored in dedicated, organized build directories to prevent source material modification.

  • Enforces a pause-and-confirm interaction model where the agent reads exactly three splits (approx. 12 pages) per cycle to maintain focus and accuracy.

  • Performs structured information extraction, focusing on research questions, target audiences, methodology, and key contributions, synthesized into a persistent notes.md file.

  • Provides intelligent state management, checking for existing extractions or split files before re-processing to save time and token costs.

  • Always provide either a specific file path or a precise search query (title, author, year) to initialize the process.

  • The tool requires the preservation of the original PDF; all processing occurs on temporary derivative split files to ensure the integrity of your document library.

  • If an existing extract (basename_text.md) is found, the agent will prompt you to use it rather than re-reading the PDF from scratch.

  • The workflow is strictly sequential: retrieve, split, read in batches, update notes, and confirm before continuing to the next 12-page block.

  • Ensure the environment has access to PyPDF2 for the splitting operations; the agent will attempt to install it if missing.

Repository Stats

Stars
332
Forks
124
Open Issues
1
Language
TeX
Default Branch
main
Sync Status
Idle
Last Synced
May 3, 2026, 05:23 AM
View on GitHub