Research
literature-engineer avatar

literature-engineer

Evidence-first literature collector for automated research pipelines. Scales paper pools to 1200+ with metadata normalization, provenance tracking, and multi-source ingestion.

Introduction

The Literature Engineer is a specialized skill designed for the 'evidence-first' stage of academic and technical research pipelines. It focuses on constructing large-scale, verifiable candidate pools (1200+ papers) essential for downstream tasks like deduplication, ranking, citation generation, and evidence synthesis. By automating the ingestion of diverse sources, it mitigates the common 'empty pool' problem that causes failure in downstream drafting and mapping phases.

This skill is built for robustness, treating data provenance as a primary requirement. Every record processed is normalized to include stable identifiers (arXiv IDs, DOIs, or trusted URLs) and detailed source tracking. It integrates seamlessly with domain-specific packs (e.g., for LLM agents) to ensure essential foundation papers are always included even if keyword searches fluctuate in quality. It is intended for researchers, technical writers, and survey automation agents who require repeatable, auditable literature stacks.

  • Multi-route data ingestion: Aggregates from local bib/jsonl/csv files, offline exports, and online API retrievals (arXiv/Semantic Scholar).

  • Provenance-first architecture: Tags every entry with its exact origin, ensuring transparency and filterability for downstream analysis.

  • Snowballing and expansion: Supports iterative expansion through reference/cited-by graphs to improve topic coverage.

  • Metadata normalization: Cleanses and canonicalizes heterogeneous inputs into a unified papers_raw.jsonl format.

  • Network-resilient design: Operates in offline-only modes or hybrid modes using proxies (e.g., jina.ai) to ensure pipeline continuity in restricted environments.

  • Usage: Primarily used as Stage C1 in the survey pipeline to ensure sufficient evidence depth before prose generation.

  • Input requirements: Requires a queries.md file for configuration; supports raw exports in papers/imports/.

  • Outputs: Generates papers_raw.jsonl (primary data), papers_raw.csv (human review), and a retrieval_report.md (coverage statistics).

  • Guardrails: Explicitly prevents paper fabrication; blocks execution if target thresholds are not met to ensure high-quality evidence stacks.

Repository Stats

Stars
422
Forks
29
Open Issues
0
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
Apr 29, 2026, 02:14 PM
View on GitHub