polars
High-performance in-memory DataFrame library for Python and Rust. Features lazy evaluation, parallel execution, and an Apache Arrow backend for efficient ETL, data processing, and faster pandas alternatives.
Introduction
Polars is a blazingly fast, multi-threaded DataFrame library built for the Rust and Python ecosystems, designed to handle medium-to-large datasets that fit within system RAM. By utilizing Apache Arrow as its core memory format and providing an expression-based API, Polars offers a highly optimized alternative to pandas for complex data manipulation. Its architecture is specifically optimized for performance through features like memory mapping, query optimization, and cache-efficient operations, making it an essential tool for data engineers and scientists who need to process 1GB to 100GB datasets without the overhead of disk-based distributed systems.
The library enables users to define complex data transformation pipelines using either an eager execution mode or a lazy evaluation framework. The lazy API is particularly powerful, as it allows the engine to perform predicate pushdown, projection pushdown, and query plan optimization before a single line of data is actually processed. This significantly reduces computation time and memory footprint during ETL, feature engineering, and exploratory data analysis tasks.
-
Expression-based API: Enables declarative, composable, and chainable transformations using syntax similar to SQL or functional programming.
-
Lazy Evaluation: Automatically optimizes query plans and pushes down filters or selections to minimize redundant data reads.
-
Multi-threaded Engine: Parallelizes operations by default across all available CPU cores, providing major performance gains over single-threaded libraries.
-
Memory Efficiency: Built on the Arrow memory model, which supports zero-copy operations and reduced memory allocation.
-
Versatile I/O: Seamlessly integrates with CSV, Parquet, JSON, Excel, and various SQL-based database connectors.
-
Window Functions: Advanced support for complex analytical calculations like
over()grouping and rolling aggregations. -
Ideal for users migrating from pandas who require better performance without switching to distributed tools like Dask or Spark.
-
Best for datasets ranging from 1GB to 100GB; for datasets exceeding RAM, consider using Dask or Vaex.
-
Input: Supports local files, cloud storage paths (S3/GCS/Azure), and streaming data inputs.
-
Output: DataFrames, optimized query plans, or direct writes to files/databases.
-
Usage Tip: Always prioritize
scan_methods (e.g.,scan_csv) overread_methods when working with large files to leverage the full benefit of lazy evaluation and query optimization.
Repository Stats
- Stars
- 19,721
- Forks
- 2,202
- Open Issues
- 42
- Language
- Python
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- Apr 29, 2026, 02:53 PM