3D Aware Region Prompted Vision Language Model

Anonymous Submission 669

Code Dataset Benchmark

3D‑Aware Region Prompting Multi‑View Spatial Reasoning

Architecture

A key idea of SR-3D is the introduction of a canonical positional representation shared across single-view and multi-view inputs. This unified representation enables large-scale single-view pretraining and supports the transfer of learned spatial priors to multi-view settings.

Performance on 2D Spatial Benchmarks

Incorporating 3D positional information improves spatial understanding in single-view models; comparing to the base model NVILA-Lite-8B, SR-3D achieves higher spatial performance.

3D Aware Region Prompted Vision Language Model

Architecture

Performance on 2D Spatial Benchmarks

3D Region-Level Spatial Understanding

3D Scene Benchmarks