I Reduced My Pandas Runtime by 95% — Here’s What I Was Doing Wrong

2026-04-26 · Source: Towards Data Science · Field: Technology & Digital — Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, long

Summary

This article details common Pandas performance pitfalls and how to optimize data processing workflows, moving from "working code" to "efficient code." It highlights that Pandas prioritizes convenience over speed, often leading to silently inefficient operations on larger datasets. The author demonstrates how row-wise operations like `.iterrows()` and `.apply(axis=1)` are significantly slower than vectorized NumPy-backed operations, showing a 14,800x speedup in one example. The piece also covers memory optimization, explaining how unnecessary data copies and default `object` or `int64`/`float64` data types can bloat memory, advocating for `astype('category')` or `int32` conversions. Finally, it addresses the limits of Pandas, suggesting tools like Polars, Dask, or DuckDB for datasets exceeding millions of rows, illustrating a real-world refactor that reduced runtime from 61.78 seconds to 0.33 seconds for a 1 million-row dataset.

Key takeaway

For Data Scientists and Machine Learning Engineers struggling with slow Pandas notebooks on large datasets, prioritize a shift in mindset from merely functional code to efficient, vectorized operations. You should actively profile your code with `%timeit` and `df.memory_usage()` to pinpoint bottlenecks, then refactor row-wise logic to column-wise operations and optimize data types. This approach can yield dramatic performance improvements, making your data pipelines scalable and significantly faster.

Key insights

Efficient Pandas code prioritizes vectorized operations and memory management over row-wise processing for significant speed gains.

Principles

Working code is not always efficient code.
Measure performance before optimizing.
Pandas optimizes for convenience, not speed.

Method

Profile Pandas code using `%timeit` and `df.memory_usage()` to identify bottlenecks. Replace row-wise operations with vectorized alternatives and optimize data types to reduce memory footprint.

In practice

Use `df['col'] * df['col2']` instead of `df.apply(lambda row: ..., axis=1)`.
Convert string columns to `category` type for memory savings.
Downcast integer/float types (e.g., `int64` to `int32`) when possible.

Topics

Pandas Performance Optimization
Vectorized Operations
Data Type Conversion
Memory Usage Analysis
Code Profiling

Best for: Data Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.