Martingale Doppelg\"anger-Eval: An Identification Framework for Auditing Candlestick Understanding in Vision-Language Models

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, FinTech & Digital Financial Services · Depth: Expert, extended

Summary

Martingale Doppelgänger-Eval is a new public shadow-market benchmark designed to audit vision-language models' (VLMs) understanding of candlestick charts, specifically distinguishing genuine visual evidence use from trend extrapolation. The benchmark addresses the "identification problem," proving that observational scores cannot separate grounded from trend-shortcut responders due to strong coupling between chart evidence and past trends. It evaluates frozen VLMs using four controlled mechanisms: a martingale-null market (M0), injected-alpha counterfactual pairs (M1), trend-confounder swaps (M2), and regime shifts (M3). Across commercial and open VLMs, the identified regression shows large positive coefficients for past trend (βS from 0.20 to 0.41) but evidence coefficients (βE) that are zero or opposite to rule-implied signs. Models are trend-biased, weak on injected evidence, and vulnerable to trend-confounder shortcuts.

Key takeaway

For machine learning engineers developing or deploying VLMs for financial analysis, you must move beyond observational accuracy metrics. Your models likely exhibit significant trend-following shortcuts, ignoring or misinterpreting local candlestick evidence. Implement interventional audit frameworks like Martingale Doppelgänger-Eval to rigorously test genuine visual understanding and ensure your models' explanations align with causal edits, rather than just extrapolating momentum. This prevents confident but unsupported explanations in critical decision-support systems.

Key insights

Observational VLM evaluation cannot certify genuine candlestick understanding due to trend-evidence coupling; interventional designs are critical.

Principles

Method

The benchmark uses four controlled mechanisms: martingale-null market (M0), injected-alpha counterfactual pairs (M1), trend-confounder swaps (M2), and regime shifts (M3), analyzed via a structural behavioral model.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.