TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance
Summary
TraRA (Trajectory-level Recognition Aggregation) is a plug-and-play method designed to enhance Video Text Spotting (VTS) in urban surveillance by addressing frame-level recognition inconsistencies. It integrates two modules: Temporal Clustering (TC), which refines noisy text trajectories by grouping visually and temporally coherent instances, and Vision-Language Aggregation (VLA), which uses a Low-Rank Adaptation (LoRA)-enhanced Vision-Language Model (VLM) to fuse visual cues with linguistic context across frames. This approach enables robust text recognition despite challenges like motion blur and occlusion. Experiments on four benchmarks—ArTVideo, RoadText, BOVText, and ICDAR15—demonstrate TraRA's consistent improvement in tracking and recognition performance. For instance, when combined with GoMatching++, it boosted recognition accuracy (WA) by +1.67 on ArTVideo, +2.54 on RoadText, and +4.65 on BOVText, with MOTA gains up to +18.0 on BOVText. With TransDETR, WA increased by +50.49 on RoadText and +47.85 on BOVText.
Key takeaway
For Machine Learning Engineers developing urban surveillance or intelligent transportation systems, TraRA offers a significant improvement for video text spotting. If your current VTS models struggle with dynamic factors like motion blur or occlusions, integrating TraRA can substantially boost recognition accuracy and tracking stability. Consider adopting its trajectory-level aggregation approach, especially the VLA module, to enhance robustness and reduce frame-level inconsistencies in your deployments.
Key insights
TraRA improves video text spotting by aggregating temporal and multimodal cues for robust trajectory-level recognition.
Principles
- Aggregate information across entire text trajectories.
- Refine noisy trajectories using temporal and visual consistency.
- Fuse visual and linguistic context for robust word prediction.
Method
TraRA refines VTS trajectories via Temporal Clustering (TC) to group consistent instances, then uses a LoRA-enhanced VLM for Vision-Language Aggregation (VLA) to predict words from aggregated visual and linguistic cues.
In practice
- Apply HOG, SIFT, and bounding box area for discriminative text features.
- Use online time-based clustering with adaptive thresholds for trajectory refinement.
- Fine-tune VLMs with LoRA for efficient adaptation to video text data.
Topics
- Video Text Spotting
- Trajectory-level Recognition
- Vision-Language Models
- Low-Rank Adaptation
- Urban Surveillance
- Temporal Clustering
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.