Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

2026-06-11 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Custom ZeroCLIP is a novel retrieval-augmented vision-language framework designed for zero-shot captioning of traditional Indonesian garments. This framework utilizes a dataset of 3,800 expert-annotated images spanning all 38 Indonesian provinces. It employs a province-level inductive zero-shot protocol, training on 24 seen provinces, validating on 6 seen, and evaluating on 8 unseen provinces. The architecture integrates a frozen CLIP ViT-B/32 image encoder, CLIP and BERT text encoders, and an LSTM caption decoder. During inference, the model retrieves captions solely from training provinces, ensuring no unseen-province data is used. Custom ZeroCLIP achieved a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, surpassing existing baselines. Ablation studies confirmed retrieval's role in enhancing cultural vocabulary recovery, yielding a 19.3% METEOR gain, with human evaluation validating its cultural accuracy and fluency. The dataset is publicly available.

Key takeaway

For Computer Vision Engineers developing AI for cultural heritage, Custom ZeroCLIP demonstrates a robust approach to zero-shot captioning. You should consider integrating retrieval-augmented vision-language frameworks to improve cultural vocabulary recovery and accuracy in low-resource settings. This method, particularly with inductive zero-shot protocols, allows your models to generalize effectively to unseen regional variations, enhancing the utility of automated image analysis for diverse cultural artifacts.

Key insights

Retrieval-augmented vision-language models effectively generate culturally accurate captions for low-resource heritage data.

Principles

Inductive zero-shot protocols enable generalization to unseen domains.
Retrieval augmentation significantly boosts cultural vocabulary recovery.
Combining multiple encoders enhances domain adaptation.

Method

Custom ZeroCLIP combines a frozen CLIP ViT-B/32 image encoder, CLIP and BERT text encoders, and an LSTM caption decoder, using retrieval from seen-province captions during inference for unseen provinces.

In practice

Apply retrieval-augmented captioning for niche cultural datasets.
Use province-level zero-shot for diverse regional variations.
Integrate multiple pre-trained encoders for robust domain adaptation.

Topics

Zero-Shot Captioning
Cultural Heritage AI
Vision-Language Models
CLIP
Retrieval Augmentation
Indonesian Garments

Code references

AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.