Data Preprocessing in Machine Learning: Working with Numerical & Categorical Data

2026-05-31 · Source: Data Science on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Novice, medium

Summary

This guide introduces essential data preprocessing techniques for machine learning, emphasizing its critical role before model training. It details methods for handling numerical data, including filling missing values using mean or median, applying feature scaling through normalization (0-1 range) or standardization (mean/spread adjustment), and detecting/treating outliers by removal, capping, or transformation. For categorical data, the guide explains label encoding (categories to numbers), one-hot encoding (creating binary columns), and ordinal encoding (preserving order), alongside strategies for rare or unknown categories. The content also covers general data cleaning, feature transformation, real-world applications in healthcare and banking, and common preprocessing mistakes, highlighting how proper preparation improves model accuracy and efficiency.

Key takeaway

For data scientists preparing datasets for model training, prioritizing robust data preprocessing is crucial. You should meticulously handle missing values, scale numerical features appropriately, and encode categorical data using methods like one-hot or ordinal encoding. Ignoring these steps can lead to unreliable predictions and inefficient model learning. Always check for outliers and duplicates to ensure your model learns from the most accurate and balanced data possible.

Key insights

Data preprocessing is fundamental for machine learning model accuracy, ensuring clean, transformed data for effective learning.

Principles

Clean data improves model accuracy.
Models learn better from scaled numerical data.
Categorical data needs numerical conversion.

Method

The article describes a general workflow: identify data types (numerical/categorical), handle missing values, scale numerical features, encode categorical features, detect/treat outliers, and clean for duplicates/errors.

In practice

Fill missing numerical values with mean/median.
Apply one-hot encoding for nominal categories.
Group rare categories to simplify datasets.

Topics

Data Preprocessing
Numerical Data Handling
Categorical Data Encoding
Feature Scaling
Outlier Detection
Machine Learning Workflows

Best for: AI Student, Data Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.