SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

SAFE-Cascade is an interactive system designed for cost-adaptive chart question answering, addressing the expense of invoking vision-language models (VLMs) for every query. The system first extracts chart text using Azure Document Intelligence for OCR, then obtains a provisional answer from a text-only language model, gpt-5-mini. A Random Forest router, trained on inference-time features, subsequently decides whether to accept this text answer or escalate the query to a VLM, gemini-2.5-flash-image. On a 375-example ChartQA test split, SAFE-Cascade achieved 69.1% unified accuracy with 73.1% VLM invocation. This performance matches a full-VLM baseline (67.7% accuracy, 100% VLM invocation) while reducing VLM calls by 26.9% and estimated cost by 9.3%. The system also offers a transparent user interface, displaying OCR evidence, routing probability, and allowing users to adjust the escalation threshold.

Key takeaway

For AI Architects designing chart question answering systems, you should consider implementing a cost-adaptive routing mechanism. This approach, demonstrated by SAFE-Cascade, allows you to maintain accuracy comparable to full VLM invocation while significantly reducing operational expenses by selectively engaging expensive models. Evaluate your current VLM usage and explore multi-stage pipelines to optimize resource allocation and enhance system transparency.

Key insights

Selective modality routing can match VLM performance while significantly reducing operational costs and increasing transparency.

Principles

Prioritize text-only reasoning for cost savings.
Use a learned router for VLM escalation.
Expose decision logic for user transparency.

Method

The system extracts chart text via OCR, obtains a provisional answer from a text-only LM, then uses a Random Forest router to decide VLM escalation.

In practice

Implement a multi-stage QA pipeline.
Train routers on inference-time features.
Allow users to tune cost-accuracy thresholds.

Topics

Chart Question Answering
Vision-Language Models
Cost Optimization
Modality Routing
Multimodal Systems
Random Forest Classifier

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.