Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

2026-03-19 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A systematic study investigated sub-word tokenization and concatenation-based data augmentation to address challenges in Inertial Measurement Unit (IMU)-based online handwriting recognition (OnHWR), specifically uneven character distributions and inter-writer variability. Experiments on the OnHW-Words500 dataset revealed that Bigram tokenization significantly improved performance on the writer-independent (WI) split, reducing the word error rate (WER) from 15.40% to 12.99% for unseen writing styles. Conversely, tokenization degraded performance on the writer-dependent (WD) split due to vocabulary distribution shifts. For the WD split, a proposed concatenation-based data augmentation method acted as a powerful regularizer, reducing the character error rate (CER) by 34.5% and WER by 25.4%. The study concluded that sub-word tokenization mitigates inter-writer stylistic variability, while concatenation-based data augmentation compensates for intra-writer distributional sparsity.

Key takeaway

For AI Scientists developing IMU-based OnHWR systems, your choice of generalization strategy should align with the specific variance challenge. If building a system for diverse users, implement Bigram tokenization to abstract stylistic variations. If addressing data sparsity within a single user's profile, prioritize concatenation-based data augmentation to regularize and balance character distributions, as this significantly outperforms extended training durations.

Key insights

Optimal OnHWR strategies depend on whether the system addresses inter-writer style variability or intra-writer data sparsity.

Principles

Bigram tokenization excels for inter-writer variability.
Concatenation-based augmentation suits intra-writer sparsity.
Short, low-level tokens benefit model performance.

Method

The study used the REWI CNN-LSTM architecture, modifying text-to-class mapping with Bigram, BPE, and Unigram tokenization, and enhancing data preprocessing with concatenation-based data augmentation.

In practice

Use Bigram tokenization for diverse user bases.
Apply concatenation-based augmentation for imbalanced datasets.
Prioritize short tokens in OnHWR models.

Topics

Online Handwriting Recognition
IMU-based Recognition
Sub-word Tokenization
Data Augmentation
Writer Variability

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.