data-cleaning-pipeline-generator
Generates data cleaning pipelines for pandas/polars/PySpark, handling missing values, duplicates, outliers, type conversions, and validation.
Introduction
The Data Cleaning Pipeline Generator is a specialized agent skill designed for data scientists, analysts, and engineers who need to automate the preprocessing of messy datasets. This tool generates robust, production-ready Python pipelines that integrate seamlessly with major data processing libraries including pandas, polars, and PySpark. It simplifies the transition from raw, noisy data to clean, analysis-ready structures by implementing industry best practices for data quality assurance.
-
Automatically detects and resolves missing values using statistical strategies like mean, median, or custom placeholders.
-
Provides efficient deduplication methods for multi-column subsets or full row matching.
-
Includes robust outlier detection and removal mechanisms using IQR (Interquartile Range) or Z-score statistical methods.
-
Performs automated data type casting, including intelligent date/time parsing and categorical encoding (label or one-hot encoding).
-
Implements text normalization features such as whitespace stripping and casing adjustments.
-
Supports validation rule definitions for numeric ranges to ensure data integrity before further analysis.
-
Users should trigger this skill when asking to clean datasets, remove duplicate entries, fix inconsistent data types, or handle null/NaN values.
-
The skill expects input in the form of tabular data (CSV, Parquet, or SQL exports) and provides modular, class-based Python code for easy integration into existing notebooks or batch scripts.
-
The generated code includes a logging utility to track the impact of each cleaning step, providing full transparency on how many rows were dropped or modified.
-
While ideal for standard pandas workflows, the logic is structured to be adaptable for larger datasets requiring PySpark or Polars distributed computing capabilities.
-
Ensure column names and data types are well-defined before invocation to allow the automatic detection features to function with maximum accuracy.
Repository Stats
- Stars
- 5
- Forks
- 2
- Open Issues
- 0
- Language
- TypeScript
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 3, 2026, 05:23 PM