Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods
Summary
Advancing WordArt-Oriented Scene Text Recognition (WATER) addresses the challenge of recognizing highly customized artistic text, which existing Scene Text Recognition (STR) methods struggle with due to their focus on regular text and fixed-template inputs. Researchers constructed WATER-S, a 2M synthetic dataset, improving scale by hundreds of times. This dataset includes data rendered by an upgraded SynthWordArt pipeline and data generated using Qwen3-VL for prompt mining and Z-Image for image synthesis. They also proposed WATERec, a model featuring a visual encoder for arbitrary-shaped inputs and an autoregressive decoder for complex layouts. Combined with WATER-R (reorganized real STR data), this approach achieved 90.40% accuracy on WordArt-Bench, significantly surpassing general-purpose and OCR-specialized vision-language models.
Key takeaway
For machine learning engineers developing robust OCR solutions for highly stylized or artistic text, existing general STR methods are insufficient. You should consider specialized datasets like WATER-S and models such as WATERec, which support arbitrary-shaped inputs and complex layouts. Adopting these advanced techniques can significantly improve recognition accuracy, achieving 90.40% on benchmarks like WordArt-Bench, surpassing general-purpose vision-language models.
Key insights
Recognizing highly customized WordArt demands specialized datasets and models beyond general Scene Text Recognition.
Principles
- Customized fonts and layouts increase text recognition complexity.
- Synthetic data generation can significantly expand dataset scale.
- Arbitrary-shaped input encoders improve artistic text recognition.
Method
Construct WATER-S using SynthWordArt and Qwen3-VL/Z-Image for diverse synthetic data. Develop WATERec with an arbitrary-shaped visual encoder and an autoregressive decoder.
In practice
- Utilize Qwen3-VL for effective prompt mining in data synthesis.
- Employ Z-Image for generating diverse and realistic images.
- Integrate arbitrary-shaped input encoders for artistic text tasks.
Topics
- WordArt Recognition
- Scene Text Recognition
- Synthetic Data Generation
- Qwen3-VL
- Z-Image
- Autoregressive Decoders
- WordArt-Bench
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.