Martingale Doppelg\"anger-Eval: An Identification Framework for Auditing Candlestick Understanding in Vision-Language Models

2026-06-17 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, FinTech & Digital Financial Services · Depth: Expert, extended

Summary

Martingale Doppelgänger-Eval is a new public shadow-market benchmark designed to audit vision-language models' (VLMs) understanding of candlestick charts, specifically distinguishing genuine visual evidence use from trend extrapolation. The benchmark addresses the "identification problem," proving that observational scores cannot separate grounded from trend-shortcut responders due to strong coupling between chart evidence and past trends. It evaluates frozen VLMs using four controlled mechanisms: a martingale-null market (M0), injected-alpha counterfactual pairs (M1), trend-confounder swaps (M2), and regime shifts (M3). Across commercial and open VLMs, the identified regression shows large positive coefficients for past trend (βS from 0.20 to 0.41) but evidence coefficients (βE) that are zero or opposite to rule-implied signs. Models are trend-biased, weak on injected evidence, and vulnerable to trend-confounder shortcuts.

Key takeaway

For machine learning engineers developing or deploying VLMs for financial analysis, you must move beyond observational accuracy metrics. Your models likely exhibit significant trend-following shortcuts, ignoring or misinterpreting local candlestick evidence. Implement interventional audit frameworks like Martingale Doppelgänger-Eval to rigorously test genuine visual understanding and ensure your models' explanations align with causal edits, rather than just extrapolating momentum. This prevents confident but unsupported explanations in critical decision-support systems.

Key insights

Observational VLM evaluation cannot certify genuine candlestick understanding due to trend-evidence coupling; interventional designs are critical.

Principles

Observational chart evaluation cannot distinguish grounded from trend-shortcut responders.
Matched evidence interventions separate responders at an exponential rate.
Martingale-null optimality implies p↑=1/2 is Bayes-optimal for paired labels.

Method

The benchmark uses four controlled mechanisms: martingale-null market (M0), injected-alpha counterfactual pairs (M1), trend-confounder swaps (M2), and regime shifts (M3), analyzed via a structural behavioral model.

In practice

Audit time-series imagery using martingale-null labels and counterfactual evidence.
Decompose VLM responses into null-market bias, trend, and evidence sensitivity.
Use block-aware sequential testing for metered API evaluations.

Topics

Vision-Language Models
Candlestick Charts
Financial Time Series
Model Auditing
Shortcut Learning
Counterfactual Evaluation

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.