Unified Pix Token And Word Token Generative Language Model

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

A new model unifies pixel tokens and word tokens within a generative language model, addressing limitations in visual understanding, particularly for fine details like small text or numbers in images, which are common in existing multimodal models relying on Vision Transformers (ViT) with CLIP or SigLIP. The proposed architecture introduces "Pix Token Embedding," where each pixel has its own adjustable token embedding, and "Color Folding" to significantly reduce computational complexity by quantizing color values without perceptible distortion. It also features "Global Conditional Attention Approximation" and supports image unsupervised pretraining. Initial experiments with a 120-million-parameter model and 5 billion pixel tokens from the llava-cc3m-pretrain-595k dataset demonstrate feasibility and good performance, suggesting adherence to scaling laws with increased parameters and data.

Key takeaway

Research Scientists developing multimodal generative models should consider adopting a unified pixel and word token architecture. This approach, which moves beyond ViT-based encoders, promises enhanced visual detail understanding and enables powerful unsupervised image pretraining. You can achieve significant computational savings by implementing "Color Folding" with a factor of 8 or 16, while maintaining visual fidelity. This shift could lead to more robust and scalable multimodal AI systems.

Key insights

Unifying pixel and word tokens in generative models enhances visual detail understanding and enables unsupervised image pretraining.

Principles

Method

The model converts global pixel token sequences into local window batches, applies mask multi-head self-attention with rotary position embedding, and then unifies these with word token embeddings for joint attention and prediction.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.