Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition
Summary
A systematic study investigated sub-word tokenization and concatenation-based data augmentation to address challenges in Inertial Measurement Unit (IMU)-based online handwriting recognition (OnHWR), specifically uneven character distributions and inter-writer variability. Experiments on the OnHW-Words500 dataset revealed that Bigram tokenization significantly improved performance on the writer-independent (WI) split, reducing the word error rate (WER) from 15.40% to 12.99% for unseen writing styles. Conversely, tokenization degraded performance on the writer-dependent (WD) split due to vocabulary distribution shifts. For the WD split, a proposed concatenation-based data augmentation method acted as a powerful regularizer, reducing the character error rate (CER) by 34.5% and WER by 25.4%. The study concluded that sub-word tokenization mitigates inter-writer stylistic variability, while concatenation-based data augmentation compensates for intra-writer distributional sparsity.
Key takeaway
For AI Scientists developing IMU-based OnHWR systems, your choice of generalization strategy should align with the specific variance challenge. If building a system for diverse users, implement Bigram tokenization to abstract stylistic variations. If addressing data sparsity within a single user's profile, prioritize concatenation-based data augmentation to regularize and balance character distributions, as this significantly outperforms extended training durations.
Key insights
Optimal OnHWR strategies depend on whether the system addresses inter-writer style variability or intra-writer data sparsity.
Principles
- Bigram tokenization excels for inter-writer variability.
- Concatenation-based augmentation suits intra-writer sparsity.
- Short, low-level tokens benefit model performance.
Method
The study used the REWI CNN-LSTM architecture, modifying text-to-class mapping with Bigram, BPE, and Unigram tokenization, and enhancing data preprocessing with concatenation-based data augmentation.
In practice
- Use Bigram tokenization for diverse user bases.
- Apply concatenation-based augmentation for imbalanced datasets.
- Prioritize short tokens in OnHWR models.
Topics
- Online Handwriting Recognition
- IMU-based Recognition
- Sub-word Tokenization
- Data Augmentation
- Writer Variability
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.