SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering
Summary
SAFE-Cascade is an interactive system designed for cost-adaptive chart question answering, addressing the expense of invoking vision-language models (VLMs) for every query. The system first extracts chart text using Azure Document Intelligence for OCR, then obtains a provisional answer from a text-only language model, gpt-5-mini. A Random Forest router, trained on inference-time features, subsequently decides whether to accept this text answer or escalate the query to a VLM, gemini-2.5-flash-image. On a 375-example ChartQA test split, SAFE-Cascade achieved 69.1% unified accuracy with 73.1% VLM invocation. This performance matches a full-VLM baseline (67.7% accuracy, 100% VLM invocation) while reducing VLM calls by 26.9% and estimated cost by 9.3%. The system also offers a transparent user interface, displaying OCR evidence, routing probability, and allowing users to adjust the escalation threshold.
Key takeaway
For AI Architects designing chart question answering systems, you should consider implementing a cost-adaptive routing mechanism. This approach, demonstrated by SAFE-Cascade, allows you to maintain accuracy comparable to full VLM invocation while significantly reducing operational expenses by selectively engaging expensive models. Evaluate your current VLM usage and explore multi-stage pipelines to optimize resource allocation and enhance system transparency.
Key insights
Selective modality routing can match VLM performance while significantly reducing operational costs and increasing transparency.
Principles
- Prioritize text-only reasoning for cost savings.
- Use a learned router for VLM escalation.
- Expose decision logic for user transparency.
Method
The system extracts chart text via OCR, obtains a provisional answer from a text-only LM, then uses a Random Forest router to decide VLM escalation.
In practice
- Implement a multi-stage QA pipeline.
- Train routers on inference-time features.
- Allow users to tune cost-accuracy thresholds.
Topics
- Chart Question Answering
- Vision-Language Models
- Cost Optimization
- Modality Routing
- Multimodal Systems
- Random Forest Classifier
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.