Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery
Summary
A new framework has been developed to automatically assess building conditions across the United States using multimodal Large Language Models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, the approach achieves strong alignment with human mean opinion scores (MOS), outperforming individual human raters on Spearman's rank correlation coefficient (SRCC) and Pearson's linear correlation coefficient (PLCC). To boost efficiency, knowledge distillation transfers capabilities from Gemma 3 27B to a smaller Gemma 3 4B model, yielding comparable performance with a 3x speedup, and further to CNN-based (EfficientNetV2-M) and transformer (SwinV2-B) models, achieving a 30x speed gain. The framework also investigates LLM capabilities for assessing an extensive list of built environment and housing attributes through a human–AI alignment study and includes a visualization dashboard for homeowners to analyze LLM assessment outcomes.
Key takeaway
For Computer Vision Engineers developing automated property assessment systems, this research demonstrates that fine-tuned multimodal LLMs like Gemma 3 27B can achieve human-level accuracy in evaluating building conditions from street-view imagery. You should consider implementing knowledge distillation to smaller models such as EfficientNetV2-M or SwinV2-B to achieve significant inference speedups (up to 30x) while maintaining high correlation with human expert judgments, making large-scale deployments practical and cost-effective for millions of images.
Key insights
Multimodal LLMs, fine-tuned and distilled, can automate large-scale building condition assessment from street-view imagery with high accuracy and efficiency.
Principles
- Fine-tuning LLMs with limited data improves human alignment.
- Knowledge distillation enhances efficiency without significant performance loss.
- LLMs can assess diverse housing attributes beyond overall condition.
Method
The method involves preprocessing GSV images with GroundingDINO, fine-tuning Gemma 3 27B using QLoRA on human-labeled data, and then distilling its knowledge into smaller LLMs or vision models for faster inference.
In practice
- Use Gemma 3 27B for single-GPU building condition evaluation.
- Fine-tune with 500 labeled images to match human-level consistency.
- Distill knowledge to EfficientNetV2-M or SwinV2-B for 30x speedup.
Topics
- Multimodal LLMs
- Building Condition Assessment
- Google Street View Imagery
- Knowledge Distillation
- Gemma 3
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.