data-cleaning-pipeline-generator

Introduction

The Data Cleaning Pipeline Generator is a specialized agent skill designed for data scientists, analysts, and engineers who need to automate the preprocessing of messy datasets. This tool generates robust, production-ready Python pipelines that integrate seamlessly with major data processing libraries including pandas, polars, and PySpark. It simplifies the transition from raw, noisy data to clean, analysis-ready structures by implementing industry best practices for data quality assurance.

Automatically detects and resolves missing values using statistical strategies like mean, median, or custom placeholders.
Provides efficient deduplication methods for multi-column subsets or full row matching.
Includes robust outlier detection and removal mechanisms using IQR (Interquartile Range) or Z-score statistical methods.
Performs automated data type casting, including intelligent date/time parsing and categorical encoding (label or one-hot encoding).
Implements text normalization features such as whitespace stripping and casing adjustments.
Supports validation rule definitions for numeric ranges to ensure data integrity before further analysis.
Users should trigger this skill when asking to clean datasets, remove duplicate entries, fix inconsistent data types, or handle null/NaN values.
The skill expects input in the form of tabular data (CSV, Parquet, or SQL exports) and provides modular, class-based Python code for easy integration into existing notebooks or batch scripts.
The generated code includes a logging utility to track the impact of each cleaning step, providing full transparency on how many rows were dropped or modified.
While ideal for standard pandas workflows, the logic is structured to be adaptable for larger datasets requiring PySpark or Polars distributed computing capabilities.
Ensure column names and data types are well-defined before invocation to allow the automatic detection features to function with maximum accuracy.

Startup Courses

Online Courses

Physical Courses

data-cleaning-pipeline-generator

Introduction

Repository Stats