Research
paper-reproduce avatar

paper-reproduce

Systematic methodology for reproducing published academic papers using provided data, including sample selection, statistical verification, and automated reporting.

Introduction

The paper-reproduce skill provides a rigorous, step-by-step framework for replicating academic research findings. Designed for researchers, data scientists, and students, it guides users through the entire reproduction pipeline, ensuring that statistical results, variable definitions, and sample structures match the source publication. By leveraging automated data exploration and validation techniques, the skill minimizes the 'black box' nature of empirical research, allowing users to identify and document deviations between harmonized datasets and original studies.

  • Full pipeline support: From initial data exploration and variable mapping to final statistical analysis and result comparison.

  • Rigorous variable identification: Includes multi-step techniques like semantic searching, range validation, and cross-variable summation to reconcile harmonized survey data with论文 (paper) definitions.

  • Systematic sample filtering: Automates the exclusion process to ensure population sizes match the target paper, with built-in logging for transparency.

  • Advanced statistical analysis: Supports OLS, robust standard errors (HC3), interaction terms, and stratified analyses, providing standardized beta coefficients for comparability.

  • Automated documentation: Generates professional-grade output in Markdown and LaTeX/PDF formats, including formatted tables for Table 1 (descriptive stats) and regression tables.

  • Prioritize data verification: Always validate variable ranges and means before building models; do not assume variable mappings are correct until verified against paper descriptions.

  • Handle deviations: Differences between harmonized data and original datasets are expected; use the provided three-tier status labeling (Validated, Consistent Trend, Non-replicated) to document these gaps.

  • Input requirements: Expects raw data files (Stata .dta, CSV, or SAS) and the reference paper (PDF or methodology text).

  • Output structure: Automatically organizes scripts, analysis logs, and generated reports into a standardized directory structure, ensuring reproducibility of the reproduction process itself.

  • Statistical nuances: Be mindful of standard error estimation, categorical variable encoding, and sub-sample standardization protocols to avoid systematic bias.

Repository Stats

Stars
703
Forks
194
Open Issues
6
Language
TeX
Default Branch
main
Sync Status
Idle
Last Synced
May 1, 2026, 07:29 AM
View on GitHub