Martingale Doppelg\"anger-Eval: An Identification Framework for Auditing Candlestick Understanding in Vision-Language Models
Summary
Martingale Doppelgänger-Eval is a new public shadow-market benchmark designed to audit vision-language models' (VLMs) understanding of candlestick charts, specifically distinguishing genuine visual evidence use from trend extrapolation. The benchmark addresses the "identification problem," proving that observational scores cannot separate grounded from trend-shortcut responders due to strong coupling between chart evidence and past trends. It evaluates frozen VLMs using four controlled mechanisms: a martingale-null market (M0), injected-alpha counterfactual pairs (M1), trend-confounder swaps (M2), and regime shifts (M3). Across commercial and open VLMs, the identified regression shows large positive coefficients for past trend (βS from 0.20 to 0.41) but evidence coefficients (βE) that are zero or opposite to rule-implied signs. Models are trend-biased, weak on injected evidence, and vulnerable to trend-confounder shortcuts.
Key takeaway
For machine learning engineers developing or deploying VLMs for financial analysis, you must move beyond observational accuracy metrics. Your models likely exhibit significant trend-following shortcuts, ignoring or misinterpreting local candlestick evidence. Implement interventional audit frameworks like Martingale Doppelgänger-Eval to rigorously test genuine visual understanding and ensure your models' explanations align with causal edits, rather than just extrapolating momentum. This prevents confident but unsupported explanations in critical decision-support systems.
Key insights
Observational VLM evaluation cannot certify genuine candlestick understanding due to trend-evidence coupling; interventional designs are critical.
Principles
- Observational chart evaluation cannot distinguish grounded from trend-shortcut responders.
- Matched evidence interventions separate responders at an exponential rate.
- Martingale-null optimality implies p↑=1/2 is Bayes-optimal for paired labels.
Method
The benchmark uses four controlled mechanisms: martingale-null market (M0), injected-alpha counterfactual pairs (M1), trend-confounder swaps (M2), and regime shifts (M3), analyzed via a structural behavioral model.
In practice
- Audit time-series imagery using martingale-null labels and counterfactual evidence.
- Decompose VLM responses into null-market bias, trend, and evidence sensitivity.
- Use block-aware sequential testing for metered API evaluations.
Topics
- Vision-Language Models
- Candlestick Charts
- Financial Time Series
- Model Auditing
- Shortcut Learning
- Counterfactual Evaluation
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.