Merve Noyan: The Future of Vision in ML - HF Podcast #1
Summary
Merve Noyan, an expert from Hugging Face, discusses the current state and future of computer vision, noting that many core vision problems, such as object detection and image segmentation, are largely "solved" and current efforts focus on optimization. She highlights the rapid evolution from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), which enabled significant advancements and transfer learning. The introduction of models like LLaVA marked a "GPT moment" for vision, integrating images and text to create Vision-Language Models (VLMs). Noyan emphasizes that architectural improvements are less frequent now, with a focus on refining existing models like Qwen and developing alignment techniques. She also predicts a future dominated by "world models" that compress and model the physical world for applications like autonomous driving and robotics, alongside agentic reasoning models running locally on devices.
Key takeaway
For AI Scientists and Machine Learning Engineers developing vision-based systems, recognize that while core vision tasks are highly optimized, the frontier is in multimodal integration and "world models." Prioritize exploring Vision-Language-Action models for robotics and agentic systems, and consider fine-tuning smaller, specialized models for specific tasks rather than defaulting to large, general VLMs to optimize efficiency and deployment on edge devices.
Key insights
Vision models are maturing, shifting from architectural breakthroughs to optimization and integration with language for real-world applications.
Principles
- Vision Transformers (ViTs) offer superior scalability and transfer learning compared to CNNs.
- Vision-Language Models (VLMs) like LLaVA enable multimodal understanding by connecting image and text encoders.
- Open-source initiatives standardize model definitions and foster reproducible research.
Method
To train a LLaVA-like model, combine an image encoder (e.g., CLIP) and an LLM, then train a projection layer between them, followed by instruction fine-tuning with image-text pairs.
In practice
- Use smaller, task-specific vision models instead of large VLMs for focused tasks.
- Experiment with different Transformers-based object detection models on Hugging Face.
- Start with a problem, use VLMs to understand shortcomings, then train a specialized model.
Topics
- Vision Language Models
- Vision Transformers
- World Models
- Hugging Face Ecosystem
- Open-Source AI
Best for: AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.