Data Analysis
data-cleaning-pipeline-generator avatar

data-cleaning-pipeline-generator

Generates data cleaning pipelines for pandas/polars/PySpark, handling missing values, duplicates, outliers, type conversions, and validation.

Introduction

The Data Cleaning Pipeline Generator is a specialized agent skill designed for data scientists, analysts, and engineers who need to automate the preprocessing of messy datasets. This tool generates robust, production-ready Python pipelines that integrate seamlessly with major data processing libraries including pandas, polars, and PySpark. It simplifies the transition from raw, noisy data to clean, analysis-ready structures by implementing industry best practices for data quality assurance.

  • Automatically detects and resolves missing values using statistical strategies like mean, median, or custom placeholders.

  • Provides efficient deduplication methods for multi-column subsets or full row matching.

  • Includes robust outlier detection and removal mechanisms using IQR (Interquartile Range) or Z-score statistical methods.

  • Performs automated data type casting, including intelligent date/time parsing and categorical encoding (label or one-hot encoding).

  • Implements text normalization features such as whitespace stripping and casing adjustments.

  • Supports validation rule definitions for numeric ranges to ensure data integrity before further analysis.

  • Users should trigger this skill when asking to clean datasets, remove duplicate entries, fix inconsistent data types, or handle null/NaN values.

  • The skill expects input in the form of tabular data (CSV, Parquet, or SQL exports) and provides modular, class-based Python code for easy integration into existing notebooks or batch scripts.

  • The generated code includes a logging utility to track the impact of each cleaning step, providing full transparency on how many rows were dropped or modified.

  • While ideal for standard pandas workflows, the logic is structured to be adaptable for larger datasets requiring PySpark or Polars distributed computing capabilities.

  • Ensure column names and data types are well-defined before invocation to allow the automatic detection features to function with maximum accuracy.

Repository Stats

Stars
5
Forks
2
Open Issues
0
Language
TypeScript
Default Branch
main
Sync Status
Idle
Last Synced
May 3, 2026, 05:23 PM
View on GitHub