Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models
Summary
FEPBench is a new benchmark designed to evaluate Text-to-Image (T2I) models specifically for natural-science illustration generation. It comprises 1,300 high-quality scientific illustrations from three disciplines—physics and materials, geography and ecology, and biology and medicine—featuring both single-panel and multi-panel layouts. The benchmark employs multimodal large language models (MLLMs) and human experts to provide fine-grained "atom set" annotations, covering visual, textual, relational, and layout elements. FEPBench assesses T2I models across three dimensions: Instruction Faithfulness (IF), Reasoning Enrichment (RE), and Semantic Precision (SP). Evaluations of nine models, including GPT Image 2 and Nano Banana Pro, reveal that even leading closed-source models face bottlenecks in text rendering, exhibit limited scientific reasoning, and struggle to balance generation richness with precision. Closed-source models generally outperform open-source counterparts, but all models show significant room for improvement, particularly in text and relational atom faithfulness.
Key takeaway
For AI scientists and ML engineers developing or deploying T2I models for scientific illustration, you should recognize that current models, even advanced closed-source systems, significantly underperform in text rendering and scientific reasoning. Prioritize models like Nano Banana Pro that balance faithfulness, enrichment, and precision, especially when dealing with complex multi-panel layouts or high atom-set complexity. Consider using structured prompts to improve instruction faithfulness, particularly with open-source models, but be aware this might reduce semantic precision.
Key insights
FEPBench offers a fine-grained, atom-set-based benchmark for T2I models generating scientific illustrations, revealing current limitations in text, reasoning, and precision.
Principles
- Scientific illustrations require fine-grained evaluation.
- Separate prompt-mandated from reference-only content.
- Reward faithfulness, recognize enrichment, penalize overgeneration.
Method
FEPBench constructs gold atom sets via OCR and MLLM, then uses MLLMs to verify generated illustrations against these sets, calculating Instruction Faithfulness, Reasoning Enrichment, and Semantic Precision scores.
In practice
- Use structured prompts for open-source T2I models.
- Prioritize T2I models robust to semantic complexity.
- Focus on improving text rendering in scientific figures.
Topics
- Text-to-Image Models
- Scientific Illustration
- Benchmark Evaluation
- Multimodal LLMs
- Instruction Faithfulness
- Semantic Precision
- Reasoning Enrichment
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.