Visual Agentic System for Spatial Metric Query Answering in Remote Sensing Images

Wang, YinghaoWang, ChengGünther, TobiasMontazeri, Zahra2025-05-092025-05-092025978-3-03868-269-11017-4656https://doi.org/10.2312/egp.20251028https://diglib.eg.org/handle/10.2312/egp20251028Accurately measuring real-world object dimensions from Remote Sensing (RS) images is crucial for applications in geospatial analysis and urban planning. Traditional Vision-Language Models (VLMs) struggle with spatial reasoning, while end-to-end remote sensing VLMs are often limited to predefined tasks such as image captioning. In this paper, we propose a visual agentic system for spatial metric query answering, dynamically integrating code-generation agents with a grounded remote sensing VLM and a Vision Specialist. Our system autonomously identifies reference objects, infers scale factors, and performs spatial measurements through structured subroutines. Experiments demonstrate that our approach achieves higher accuracy in footprint area estimation compared to state-of-the-art large language models with vision capabilities.Attribution 4.0 International LicenseCCS Concepts: Computing methodologies → Scene Understanding; Image Segmentation; Object IdentificationComputing methodologies → Scene UnderstandingImage SegmentationObject IdentificationVisual Agentic System for Spatial Metric Query Answering in Remote Sensing Images10.2312/egp.202510282 pages