SR-3D logo

3D Aware Region Prompted Vision Language Model

Anonymous Submission 669

3D‑Aware Region Prompting Multi‑View Spatial Reasoning

Architecture

A key idea of SR-3D is the introduction of a canonical positional representation shared across single-view and multi-view inputs. This unified representation enables large-scale single-view pretraining and supports the transfer of learned spatial priors to multi-view settings.

SR-3D model architecture diagram
❄️ Frozen and 🔥 Trainable parameters.

Performance on 2D Spatial Benchmarks

Incorporating 3D positional information improves spatial understanding in single-view models; comparing to the base model NVILA-Lite-8B, SR-3D achieves higher spatial performance.

3D Region-Level Spatial Understanding

SR-3D model architecture diagram
We show extreme cases where the same region prompts are used across samples but with different target objects. SR-3D answers all queries correctly, showing strong evidence that it truly understands 3D spatial relationships.

3D Scene Benchmarks

SR-3D model architecture diagram
Results on VSI-Bench. SR-3D answers spatial questions correctly even without region prompts.