I Spent an Hour on a Data Preprocessing Task Before Asking Gemini
Summary
A data scientist spent approximately one hour manually performing a Pandas DataFrame post-processing task that involved extracting specific probabilities. The task required creating a new column by matching `pred_category_id` values with corresponding probabilities from `text_predicted_probs` lists, based on the order found in `predicted_categories` lists. This involved converting string representations of lists to actual list objects using `ast.literal_eval`, extracting category IDs from strings, and then using list comprehensions to find indices and retrieve probabilities. After completing the manual solution, the author prompted Gemini, which generated a correct solution in seconds. Gemini's initial solution used the non-vectorized `apply` function, but a subsequent prompt yielded a more efficient, vectorized Pandas solution utilizing `explode`, `str.extract`, and filtering.
Key takeaway
For data scientists handling complex data preprocessing, you should integrate LLMs like Gemini into your workflow to drastically reduce development time for tasks like nested data extraction. While LLMs offer rapid code generation, critically review their initial outputs for performance bottlenecks, especially regarding vectorized operations versus `apply` functions on large datasets. Your domain expertise remains vital for identifying and prompting for optimized solutions.
Key insights
LLMs significantly accelerate data preprocessing tasks, but domain knowledge is crucial for evaluating and optimizing generated code for efficiency.
Principles
- Data preparation is time-intensive.
- LLMs boost coding productivity.
- Vectorized Pandas operations are efficient.
Method
To extract nested data, convert string-lists to lists, parse IDs, then use list comprehensions or vectorized Pandas operations like `explode` and `str.extract` to match and retrieve values.
In practice
- Use `ast.literal_eval` for string-lists.
- Prefer vectorized Pandas over `apply`.
- Prompt LLMs for code optimization.
Topics
- Data Preprocessing
- Pandas DataFrame
- Large Language Models
- Gemini
- Vectorized Operations
- Python Programming
Best for: Data Scientist, Machine Learning Engineer, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.