Unified Pix Token And Word Token Generative Language Model
Summary
A new model unifies pixel tokens and word tokens within a generative language model, addressing limitations in visual understanding, particularly for fine details like small text or numbers in images, which are common in existing multimodal models relying on Vision Transformers (ViT) with CLIP or SigLIP. The proposed architecture introduces "Pix Token Embedding," where each pixel has its own adjustable token embedding, and "Color Folding" to significantly reduce computational complexity by quantizing color values without perceptible distortion. It also features "Global Conditional Attention Approximation" and supports image unsupervised pretraining. Initial experiments with a 120-million-parameter model and 5 billion pixel tokens from the llava-cc3m-pretrain-595k dataset demonstrate feasibility and good performance, suggesting adherence to scaling laws with increased parameters and data.
Key takeaway
Research Scientists developing multimodal generative models should consider adopting a unified pixel and word token architecture. This approach, which moves beyond ViT-based encoders, promises enhanced visual detail understanding and enables powerful unsupervised image pretraining. You can achieve significant computational savings by implementing "Color Folding" with a factor of 8 or 16, while maintaining visual fidelity. This shift could lead to more robust and scalable multimodal AI systems.
Key insights
Unifying pixel and word tokens in generative models enhances visual detail understanding and enables unsupervised image pretraining.
Principles
- Each pixel should have a true, adjustable token embedding.
- Color quantization (folding) can reduce computational load without visual loss.
- Unsupervised pretraining is crucial for robust visual understanding.
Method
The model converts global pixel token sequences into local window batches, applies mask multi-head self-attention with rotary position embedding, and then unifies these with word token embeddings for joint attention and prediction.
In practice
- Use Color Folding with factor 8 for balanced performance and efficiency.
- Integrate image unsupervised pretraining for enhanced visual capabilities.
- Replace ViT-based vision encoders for improved detail recognition.
Topics
- Unified Pix Token Model
- Vision Transformer Limitations
- Pix Token Embedding
- Color Folding
- Global Conditional Attention Approximation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.