scraper

Introduction

The docs-scraper is a robust CLI automation agent designed for researchers, developers, and knowledge managers who need to archive web content reliably. It specializes in capturing complex, protected, or dynamic document formats into standardized, locally stored PDF files. By utilizing browser automation via a background daemon, it keeps session profiles active, allowing for seamless scraping of authenticated sources without repetitive login procedures. Whether handling internal Notion wikis, DocSend investor documents, or generic webpages requiring LLM-driven interaction, this skill provides a unified interface for document acquisition.

Multi-source support: Native handling for Notion, DocSend, and direct PDF links with intelligent fallback via Claude API for generic web documents.
Session persistence: Manage named profiles to store cookies and authentication state, ensuring consistent access to gated content.
Browser daemon: Integrated daemon keeps browser instances warm for faster job execution and includes automated file cleanup for storage management.
Dynamic data input: Supports per-scraper data fields like emails, passwords, and names to handle various login flows, including NDA-gated portals.
Job management: CLI interface for monitoring blocked jobs, retrying failed scrapes, and managing local output paths.
LLM fallback: Leverages Claude to dynamically analyze page structures, interpret login requirements, and bypass obstacles like cookie banners or popups.
Intended users: Professionals gathering competitive intelligence, developers backing up documentation, and researchers managing vast libraries of web-based resources.
Constraints: Requires Node.js and basic CLI proficiency; authentication-heavy tasks require an active ANTHROPIC_API_KEY for the LLM fallback engine.
Practical usage: Use the 'scrape' command with '-p' profile flags to maintain session continuity. Monitor blocked jobs using 'jobs list' and resolve authentication challenges using the 'update' command with specific form fields.
Data flow: Accepts target URLs as input, processes them via local headless browser automation, and outputs PDF files to the ~/.docs-scraper/output/ directory.

Startup Courses

Online Courses

Physical Courses

Introduction

Repository Stats