Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

2026-06-23 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Advancing WordArt-Oriented Scene Text Recognition (WATER) addresses the challenge of recognizing highly customized artistic text, which existing Scene Text Recognition (STR) methods struggle with due to their focus on regular text and fixed-template inputs. Researchers constructed WATER-S, a 2M synthetic dataset, improving scale by hundreds of times. This dataset includes data rendered by an upgraded SynthWordArt pipeline and data generated using Qwen3-VL for prompt mining and Z-Image for image synthesis. They also proposed WATERec, a model featuring a visual encoder for arbitrary-shaped inputs and an autoregressive decoder for complex layouts. Combined with WATER-R (reorganized real STR data), this approach achieved 90.40% accuracy on WordArt-Bench, significantly surpassing general-purpose and OCR-specialized vision-language models.

Key takeaway

For machine learning engineers developing robust OCR solutions for highly stylized or artistic text, existing general STR methods are insufficient. You should consider specialized datasets like WATER-S and models such as WATERec, which support arbitrary-shaped inputs and complex layouts. Adopting these advanced techniques can significantly improve recognition accuracy, achieving 90.40% on benchmarks like WordArt-Bench, surpassing general-purpose vision-language models.

Key insights

Recognizing highly customized WordArt demands specialized datasets and models beyond general Scene Text Recognition.

Principles

Customized fonts and layouts increase text recognition complexity.
Synthetic data generation can significantly expand dataset scale.
Arbitrary-shaped input encoders improve artistic text recognition.

Method

Construct WATER-S using SynthWordArt and Qwen3-VL/Z-Image for diverse synthetic data. Develop WATERec with an arbitrary-shaped visual encoder and an autoregressive decoder.

In practice

Utilize Qwen3-VL for effective prompt mining in data synthesis.
Employ Z-Image for generating diverse and realistic images.
Integrate arbitrary-shaped input encoders for artistic text tasks.

Topics

WordArt Recognition
Scene Text Recognition
Synthetic Data Generation
Qwen3-VL
Z-Image
Autoregressive Decoders
WordArt-Bench

Code references

YesianRohn/WATER

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.