Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing
Summary
Custom ZeroCLIP is a novel retrieval-augmented vision-language framework designed for zero-shot captioning of traditional Indonesian garments. This framework utilizes a dataset of 3,800 expert-annotated images spanning all 38 Indonesian provinces. It employs a province-level inductive zero-shot protocol, training on 24 seen provinces, validating on 6 seen, and evaluating on 8 unseen provinces. The architecture integrates a frozen CLIP ViT-B/32 image encoder, CLIP and BERT text encoders, and an LSTM caption decoder. During inference, the model retrieves captions solely from training provinces, ensuring no unseen-province data is used. Custom ZeroCLIP achieved a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, surpassing existing baselines. Ablation studies confirmed retrieval's role in enhancing cultural vocabulary recovery, yielding a 19.3% METEOR gain, with human evaluation validating its cultural accuracy and fluency. The dataset is publicly available.
Key takeaway
For Computer Vision Engineers developing AI for cultural heritage, Custom ZeroCLIP demonstrates a robust approach to zero-shot captioning. You should consider integrating retrieval-augmented vision-language frameworks to improve cultural vocabulary recovery and accuracy in low-resource settings. This method, particularly with inductive zero-shot protocols, allows your models to generalize effectively to unseen regional variations, enhancing the utility of automated image analysis for diverse cultural artifacts.
Key insights
Retrieval-augmented vision-language models effectively generate culturally accurate captions for low-resource heritage data.
Principles
- Inductive zero-shot protocols enable generalization to unseen domains.
- Retrieval augmentation significantly boosts cultural vocabulary recovery.
- Combining multiple encoders enhances domain adaptation.
Method
Custom ZeroCLIP combines a frozen CLIP ViT-B/32 image encoder, CLIP and BERT text encoders, and an LSTM caption decoder, using retrieval from seen-province captions during inference for unseen provinces.
In practice
- Apply retrieval-augmented captioning for niche cultural datasets.
- Use province-level zero-shot for diverse regional variations.
- Integrate multiple pre-trained encoders for robust domain adaptation.
Topics
- Zero-Shot Captioning
- Cultural Heritage AI
- Vision-Language Models
- CLIP
- Retrieval Augmentation
- Indonesian Garments
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.