Merve Noyan: The Future of Vision in ML - HF Podcast #1

2026-03-27 · Source: HuggingFace · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Intermediate, extended

Summary

Merve Noyan, an expert from Hugging Face, discusses the current state and future of computer vision, noting that many core vision problems, such as object detection and image segmentation, are largely "solved" and current efforts focus on optimization. She highlights the rapid evolution from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), which enabled significant advancements and transfer learning. The introduction of models like LLaVA marked a "GPT moment" for vision, integrating images and text to create Vision-Language Models (VLMs). Noyan emphasizes that architectural improvements are less frequent now, with a focus on refining existing models like Qwen and developing alignment techniques. She also predicts a future dominated by "world models" that compress and model the physical world for applications like autonomous driving and robotics, alongside agentic reasoning models running locally on devices.

Key takeaway

For AI Scientists and Machine Learning Engineers developing vision-based systems, recognize that while core vision tasks are highly optimized, the frontier is in multimodal integration and "world models." Prioritize exploring Vision-Language-Action models for robotics and agentic systems, and consider fine-tuning smaller, specialized models for specific tasks rather than defaulting to large, general VLMs to optimize efficiency and deployment on edge devices.

Key insights

Vision models are maturing, shifting from architectural breakthroughs to optimization and integration with language for real-world applications.

Principles

Vision Transformers (ViTs) offer superior scalability and transfer learning compared to CNNs.
Vision-Language Models (VLMs) like LLaVA enable multimodal understanding by connecting image and text encoders.
Open-source initiatives standardize model definitions and foster reproducible research.

Method

To train a LLaVA-like model, combine an image encoder (e.g., CLIP) and an LLM, then train a projection layer between them, followed by instruction fine-tuning with image-text pairs.

In practice

Use smaller, task-specific vision models instead of large VLMs for focused tasks.
Experiment with different Transformers-based object detection models on Hugging Face.
Start with a problem, use VLMs to understand shortcomings, then train a specialized model.

Topics

Vision Language Models
Vision Transformers
World Models
Hugging Face Ecosystem
Open-Source AI

Best for: AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.