Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated
Summary
Vision-language models (VLMs) are increasingly applied to generate structured descriptions from street-level imagery for tasks like streetscape auditing and mapping. A new argument proposes that benchmarking VLMs for urban perception must treat human disagreement and abstention as valid measurement outcomes. It also advocates for reporting inter-annotator reliability alongside model alignment and considering the label space and scoring policy as negotiable for urban governance applications. This argument is grounded in a benchmark of 100 Montreal street scenes, annotated across 30 dimensions by 12 participants from seven community organizations, and a deterministic zero-shot evaluation of seven VLMs. Findings indicate that model agreement with human consensus correlates with human reliability, and for "Overall Impression," models and annotators show distributional mismatch, including differing rates of "Not applicable."
Key takeaway
For Research Scientists developing or evaluating VLMs for urban governance applications, you must integrate human disagreement and abstention into your benchmark design and reporting. Explicitly report inter-annotator reliability alongside model alignment and treat label spaces as negotiable artifacts. This approach prevents misleading performance claims and ensures more equitable and robust urban outcomes from VLM deployments.
Key insights
Benchmarking VLMs for urban perception must account for human disagreement and abstention, treating them as valid measurement outcomes.
Principles
- Treat disagreement and abstention as measurement outcomes.
- Report inter-annotator reliability alongside model alignment.
- Label space and scoring policy are negotiable artifacts.
Method
The paper grounds its argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants, evaluated against 7 VLMs using a deterministic zero-shot approach.
In practice
- Annotate street scenes across multiple dimensions.
- Involve community organizations in annotation.
- Evaluate VLMs with zero-shot methods.
Topics
- Vision-Language Models
- Urban Perception
- Benchmark Design
- Inter-Annotator Reliability
- Model Evaluation
- Urban Governance
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.