Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Architecture & Urban Planning, Data Science & Analytics · Depth: Expert, extended

Summary

A new framework has been developed to automatically assess building conditions across the United States using multimodal Large Language Models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, the approach achieves strong alignment with human mean opinion scores (MOS), outperforming individual human raters on Spearman's rank correlation coefficient (SRCC) and Pearson's linear correlation coefficient (PLCC). To boost efficiency, knowledge distillation transfers capabilities from Gemma 3 27B to a smaller Gemma 3 4B model, yielding comparable performance with a 3x speedup, and further to CNN-based (EfficientNetV2-M) and transformer (SwinV2-B) models, achieving a 30x speed gain. The framework also investigates LLM capabilities for assessing an extensive list of built environment and housing attributes through a human–AI alignment study and includes a visualization dashboard for homeowners to analyze LLM assessment outcomes.

Key takeaway

For Computer Vision Engineers developing automated property assessment systems, this research demonstrates that fine-tuned multimodal LLMs like Gemma 3 27B can achieve human-level accuracy in evaluating building conditions from street-view imagery. You should consider implementing knowledge distillation to smaller models such as EfficientNetV2-M or SwinV2-B to achieve significant inference speedups (up to 30x) while maintaining high correlation with human expert judgments, making large-scale deployments practical and cost-effective for millions of images.

Key insights

Multimodal LLMs, fine-tuned and distilled, can automate large-scale building condition assessment from street-view imagery with high accuracy and efficiency.

Principles

Method

The method involves preprocessing GSV images with GroundingDINO, fine-tuning Gemma 3 27B using QLoRA on human-labeled data, and then distilling its knowledge into smaller LLMs or vision models for faster inference.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.