Engineering
book-sft-pipeline avatar

book-sft-pipeline

An end-to-end pipeline for converting books into fine-tuning datasets and training style-transfer models for author voice replication.

Introduction

The book-sft-pipeline is a specialized skill designed for developers and researchers aiming to create high-quality synthetic datasets from literary works and train small-scale language models (8B parameters or less) to capture specific authorial voices. This skill provides a structured framework to navigate the entire lifecycle of fine-tuning, from raw ePub extraction to model validation. It is built upon the principles of context engineering, focusing on semantic data segmentation and diverse instruction generation to prevent overfitting and ensure high-fidelity style transfer.

  • Intelligent text extraction from ePub files, prioritizing paragraph-level parsing and the removal of metadata like copyright and tables of contents to ensure clean training data.

  • Advanced text segmentation strategies that utilize semantic boundaries and word-count-based chunking (150-400 words) with overlap to preserve stylistic continuity and rhythm.

  • Diverse instruction generation capabilities using configurable system prompts and templates to teach the model to learn the author's rhythm, vocabulary, and prose patterns rather than memorizing plot points.

  • Integration with training platforms like Tinker, providing a pathway for LoRA (Low-Rank Adaptation) fine-tuning optimized for style transfer and author voice replication.

  • Built-in validation strategies, including AI-detector benchmarking and loss trajectory monitoring to ensure the model captures the desired stylistic essence.

  • Users should activate this skill when working with literary datasets, performing style transfer, building SFT (Supervised Fine-Tuning) datasets, or designing segmentation pipelines for long-form content.

  • Input requirements include raw ePub documents; the output consists of cleaned JSONL datasets compatible with standard training frameworks and LoRA adapters for target models like Qwen or similar architectures.

  • For optimal results, use at least 15 prompt templates and 5 distinct system prompts during data generation to encourage linguistic diversity.

  • Note that this skill is optimized for capturing 'style over content,' meaning it is not intended for knowledge retrieval tasks but rather for creative prose and voice emulation.

Repository Stats

Stars
15,323
Forks
1,203
Open Issues
25
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
Apr 28, 2026, 11:42 AM
View on GitHub