Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

2026-05-06 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

FEPBench is a new benchmark designed to evaluate Text-to-Image (T2I) models specifically for natural-science illustration generation. It comprises 1,300 high-quality scientific illustrations from three disciplines—physics and materials, geography and ecology, and biology and medicine—featuring both single-panel and multi-panel layouts. The benchmark employs multimodal large language models (MLLMs) and human experts to provide fine-grained "atom set" annotations, covering visual, textual, relational, and layout elements. FEPBench assesses T2I models across three dimensions: Instruction Faithfulness (IF), Reasoning Enrichment (RE), and Semantic Precision (SP). Evaluations of nine models, including GPT Image 2 and Nano Banana Pro, reveal that even leading closed-source models face bottlenecks in text rendering, exhibit limited scientific reasoning, and struggle to balance generation richness with precision. Closed-source models generally outperform open-source counterparts, but all models show significant room for improvement, particularly in text and relational atom faithfulness.

Key takeaway

For AI scientists and ML engineers developing or deploying T2I models for scientific illustration, you should recognize that current models, even advanced closed-source systems, significantly underperform in text rendering and scientific reasoning. Prioritize models like Nano Banana Pro that balance faithfulness, enrichment, and precision, especially when dealing with complex multi-panel layouts or high atom-set complexity. Consider using structured prompts to improve instruction faithfulness, particularly with open-source models, but be aware this might reduce semantic precision.

Key insights

FEPBench offers a fine-grained, atom-set-based benchmark for T2I models generating scientific illustrations, revealing current limitations in text, reasoning, and precision.

Principles

Scientific illustrations require fine-grained evaluation.
Separate prompt-mandated from reference-only content.
Reward faithfulness, recognize enrichment, penalize overgeneration.

Method

FEPBench constructs gold atom sets via OCR and MLLM, then uses MLLMs to verify generated illustrations against these sets, calculating Instruction Faithfulness, Reasoning Enrichment, and Semantic Precision scores.

In practice

Use structured prompts for open-source T2I models.
Prioritize T2I models robust to semantic complexity.
Focus on improving text rendering in scientific figures.

Topics

Text-to-Image Models
Scientific Illustration
Benchmark Evaluation
Multimodal LLMs
Instruction Faithfulness
Semantic Precision
Reasoning Enrichment

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.