A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition

2026-05-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Advanced, quick

Summary

A large-scale study involving over 2000 experiments investigates the accuracy-versus-cost trade-offs in fine-grained image recognition (FGIR) across various training and evaluation settings. The research utilized 9 pretrained backbones and 17 datasets, focusing on aspects beyond just backbone selection. Key findings highlight the effectiveness of data augmentation for fine-grained training. The study extends Counterfactual Attention Learning (CAL), a method employing data-aware cropping and masking augmentations, by integrating cross-image discriminative region mixing. Furthermore, an efficient evaluation-only variant is proposed, which maintains competitive accuracy while significantly reducing inference costs by eliminating the forward pass on discriminative crops typically used by CAL and similar FGIR methods. The results demonstrate that data-aware augmentations during training alone can achieve high accuracy without requiring crops during inference.

Key takeaway

For AI Engineers and Research Scientists optimizing fine-grained image recognition models, prioritize data-aware augmentations during training. This approach can yield excellent accuracy while substantially reducing inference costs by eliminating the need for discriminative crops during evaluation. Evaluate the proposed efficient evaluation-only variant to achieve competitive performance with lower operational expenses.

Key insights

Data-aware augmentations during training can significantly reduce FGIR inference costs while maintaining high accuracy.

Principles

Data augmentation is crucial for fine-grained training.
Inference cost can be reduced by optimizing evaluation settings.

Method

The study extends Counterfactual Attention Learning (CAL) with cross-image discriminative region mixing and proposes an evaluation-only variant that skips discriminative crop forward passes.

In practice

Use data-aware augmentations during FGIR training.
Consider evaluation-only variants for cost-efficient inference.

Topics

Fine-Grained Image Recognition
Accuracy-Cost Trade-offs
Data Augmentation
Counterfactual Attention Learning
Inference Cost Reduction

Code references

arkel23/FGIR-Backbones

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.