Research
crawl avatar

crawl

Crawl websites to extract content as clean markdown files. Ideal for documentation, research, and offline knowledge management.

Introduction

The Crawl skill is a robust web scraping agent designed for documentation gathering, knowledge base construction, and deep-web content analysis. By utilizing the Tavily API, it allows agents to intelligently navigate websites, follow links, and extract semantic content, converting complex web layouts into clean, actionable markdown. It is specifically engineered to handle the needs of researchers and engineers who require structured data from online sources without the overhead of building manual scrapers. Whether you are archiving technical documentation, analyzing market trends, or preparing datasets for retrieval-augmented generation (RAG), this tool provides the necessary control over depth, breadth, and path filtering to ensure efficient extraction. Users can choose between full-page archiving or context-optimized chunking to optimize token usage within an LLM conversation. The skill supports both OAuth-based authentication for seamless integration and API-key usage for server-side stability. It is built for scalability, allowing for recursive crawling with configurable depth and path regex patterns to exclude irrelevant data like logs, blog clutter, or administrative pages.

  • Advanced web content extraction via Tavily API with support for markdown and text formatting.

  • Recursive crawling capabilities with configurable depth (1-5 levels) and breadth limits.

  • Regex-based path filtering to precisely target documentation, API references, or specific sections.

  • Context-aware chunking mode designed for agentic research to fit content within LLM token windows.

  • OAuth and API Key support for secure and flexible deployment in any development environment.

  • Automated file output for archiving documentation, saving pages as localized markdown files.

  • High performance for data collection tasks, including site-wide scraping for offline analysis.

  • Start with max_depth=1 for initial exploration; use regex patterns to avoid circular link structures or infinite loops.

  • Use the instructions parameter to focus the crawler on specific sections like API documentation or installation guides.

  • When collecting data for local LLM storage or RAG, prefer chunks_per_source to maintain relevant context while saving tokens.

  • Ensure the Tavily API key or OAuth session is valid; utilize the provided bash script to quickly test connectivity.

  • The output_dir argument is mandatory for bulk local file storage; otherwise, results are returned as raw JSON objects.

  • Respect the crawling limits: max_depth=3 and higher should be used cautiously on large documentation sets to avoid excessive wait times.

Repository Stats

Stars
4,454
Forks
1,215
Open Issues
7
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
Apr 30, 2026, 11:11 AM
View on GitHub