Engineering
training-data-curation avatar

training-data-curation

Guidelines for curating high-quality datasets for LLM post-training (SFT/DPO/RLHF), covering data formats, quality filtering, and collection strategies.

Introduction

The training-data-curation skill provides a comprehensive framework for assembling, cleaning, and formatting datasets used in Large Language Model (LLM) post-training processes. Whether you are performing Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), or Reinforcement Learning from Human Feedback (RLHF), this skill serves as your authoritative reference for building datasets that drive model performance. It is designed for machine learning engineers, data scientists, and AI researchers tasked with improving model alignment, reducing bias, and ensuring training data adheres to professional standards. By focusing on the 'quality over quantity' paradigm, this tool helps you avoid common pitfalls such as data poisoning, formatting inconsistencies, and dataset noise.

  • Standardized formatting protocols for SFT using message-based JSONL structures, facilitating compatibility with standard trainers.

  • Structured guidelines for preference learning, including pairing techniques for DPO, ORPO, and KTO, as well as ranking strategies for RLHF.

  • Automated quality control checklists covering duplicate removal, PII identification, boilerplate filtering, and manual inspection workflows.

  • Technical heuristics for dataset health, including n-gram repetition analysis, alphabet-to-character ratio checks, and language identification using fastText.

  • Sizing benchmarks ranging from experimental datasets (100 samples) to large-scale instruction tuning (100,000+) and massive RLHF preference pools.

  • Always validate data schemas against target formats like OpenAI or Tinker API message specifications before initiating training runs.

  • Use efficient serialization formats such as Parquet for large-scale datasets to reduce I/O bottlenecks and enable compression.

  • Implement MinHash deduplication to eliminate near-duplicate entries, which is critical for preventing overfitting and maintaining data diversity.

  • Pay close attention to license provenance and ethical data collection to comply with the Data Provenance Initiative standards.

  • Treat synthetic data as a secondary source; prioritize human-annotated examples for high-stakes fine-tuning tasks to ensure model reliability and factual accuracy.

Repository Stats

Stars
149
Forks
8
Open Issues
1
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
May 1, 2026, 09:02 AM
View on GitHub