I Spent an Hour on a Data Preprocessing Task Before Asking Gemini

2026-06-23 · Source: Towards Data Science · Field: Technology & Digital — Data Science & Analytics, Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

A data scientist spent approximately one hour manually performing a Pandas DataFrame post-processing task that involved extracting specific probabilities. The task required creating a new column by matching `pred_category_id` values with corresponding probabilities from `text_predicted_probs` lists, based on the order found in `predicted_categories` lists. This involved converting string representations of lists to actual list objects using `ast.literal_eval`, extracting category IDs from strings, and then using list comprehensions to find indices and retrieve probabilities. After completing the manual solution, the author prompted Gemini, which generated a correct solution in seconds. Gemini's initial solution used the non-vectorized `apply` function, but a subsequent prompt yielded a more efficient, vectorized Pandas solution utilizing `explode`, `str.extract`, and filtering.

Key takeaway

For data scientists handling complex data preprocessing, you should integrate LLMs like Gemini into your workflow to drastically reduce development time for tasks like nested data extraction. While LLMs offer rapid code generation, critically review their initial outputs for performance bottlenecks, especially regarding vectorized operations versus `apply` functions on large datasets. Your domain expertise remains vital for identifying and prompting for optimized solutions.

Key insights

LLMs significantly accelerate data preprocessing tasks, but domain knowledge is crucial for evaluating and optimizing generated code for efficiency.

Principles

Data preparation is time-intensive.
LLMs boost coding productivity.
Vectorized Pandas operations are efficient.

Method

To extract nested data, convert string-lists to lists, parse IDs, then use list comprehensions or vectorized Pandas operations like `explode` and `str.extract` to match and retrieve values.

In practice

Use `ast.literal_eval` for string-lists.
Prefer vectorized Pandas over `apply`.
Prompt LLMs for code optimization.

Topics

Data Preprocessing
Pandas DataFrame
Large Language Models
Gemini
Vectorized Operations
Python Programming

Best for: Data Scientist, Machine Learning Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.