Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

2026-04-24 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Architecture & Urban Planning, Data Science & Analytics · Depth: Expert, extended

Summary

A new framework has been developed to automatically assess building conditions across the United States using multimodal Large Language Models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, the approach achieves strong alignment with human mean opinion scores (MOS), outperforming individual human raters on Spearman's rank correlation coefficient (SRCC) and Pearson's linear correlation coefficient (PLCC). To boost efficiency, knowledge distillation transfers capabilities from Gemma 3 27B to a smaller Gemma 3 4B model, yielding comparable performance with a 3x speedup, and further to CNN-based (EfficientNetV2-M) and transformer (SwinV2-B) models, achieving a 30x speed gain. The framework also investigates LLM capabilities for assessing an extensive list of built environment and housing attributes through a human–AI alignment study and includes a visualization dashboard for homeowners to analyze LLM assessment outcomes.

Key takeaway

For Computer Vision Engineers developing automated property assessment systems, this research demonstrates that fine-tuned multimodal LLMs like Gemma 3 27B can achieve human-level accuracy in evaluating building conditions from street-view imagery. You should consider implementing knowledge distillation to smaller models such as EfficientNetV2-M or SwinV2-B to achieve significant inference speedups (up to 30x) while maintaining high correlation with human expert judgments, making large-scale deployments practical and cost-effective for millions of images.

Key insights

Multimodal LLMs, fine-tuned and distilled, can automate large-scale building condition assessment from street-view imagery with high accuracy and efficiency.

Principles

Fine-tuning LLMs with limited data improves human alignment.
Knowledge distillation enhances efficiency without significant performance loss.
LLMs can assess diverse housing attributes beyond overall condition.

Method

The method involves preprocessing GSV images with GroundingDINO, fine-tuning Gemma 3 27B using QLoRA on human-labeled data, and then distilling its knowledge into smaller LLMs or vision models for faster inference.

In practice

Use Gemma 3 27B for single-GPU building condition evaluation.
Fine-tune with 500 labeled images to match human-level consistency.
Distill knowledge to EfficientNetV2-M or SwinV2-B for 30x speedup.

Topics

Multimodal LLMs
Building Condition Assessment
Google Street View Imagery
Knowledge Distillation
Gemma 3

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.