TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

TraRA (Trajectory-level Recognition Aggregation) is a plug-and-play method designed to enhance Video Text Spotting (VTS) in urban surveillance by addressing frame-level recognition inconsistencies. It integrates two modules: Temporal Clustering (TC), which refines noisy text trajectories by grouping visually and temporally coherent instances, and Vision-Language Aggregation (VLA), which uses a Low-Rank Adaptation (LoRA)-enhanced Vision-Language Model (VLM) to fuse visual cues with linguistic context across frames. This approach enables robust text recognition despite challenges like motion blur and occlusion. Experiments on four benchmarks—ArTVideo, RoadText, BOVText, and ICDAR15—demonstrate TraRA's consistent improvement in tracking and recognition performance. For instance, when combined with GoMatching++, it boosted recognition accuracy (WA) by +1.67 on ArTVideo, +2.54 on RoadText, and +4.65 on BOVText, with MOTA gains up to +18.0 on BOVText. With TransDETR, WA increased by +50.49 on RoadText and +47.85 on BOVText.

Key takeaway

For Machine Learning Engineers developing urban surveillance or intelligent transportation systems, TraRA offers a significant improvement for video text spotting. If your current VTS models struggle with dynamic factors like motion blur or occlusions, integrating TraRA can substantially boost recognition accuracy and tracking stability. Consider adopting its trajectory-level aggregation approach, especially the VLA module, to enhance robustness and reduce frame-level inconsistencies in your deployments.

Key insights

TraRA improves video text spotting by aggregating temporal and multimodal cues for robust trajectory-level recognition.

Principles

Aggregate information across entire text trajectories.
Refine noisy trajectories using temporal and visual consistency.
Fuse visual and linguistic context for robust word prediction.

Method

TraRA refines VTS trajectories via Temporal Clustering (TC) to group consistent instances, then uses a LoRA-enhanced VLM for Vision-Language Aggregation (VLA) to predict words from aggregated visual and linguistic cues.

In practice

Apply HOG, SIFT, and bounding box area for discriminative text features.
Use online time-based clustering with adaptive thresholds for trajectory refinement.
Fine-tune VLMs with LoRA for efficient adaptation to video text data.

Topics

Video Text Spotting
Trajectory-level Recognition
Vision-Language Models
Low-Rank Adaptation
Urban Surveillance
Temporal Clustering

Code references

trid2912/TraRA

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.