Description of image

Spatially-aware Weights Tokenization for NeRF-Language Models

University of Bologna, Italy

NeurIPS 2025

Introductory image for the paper

Explicit MLLMs divide the inputs into localized regions, to extract a spatially-aware representation. However, this method is not directly applicable to NeRFs.

How to get a spatially-aware representation from the weights of a NeRF?

Abstract

Neural Radiance Fields (NeRFs) are neural networks — typically multilayer perceptrons (MLPs) — that represent objects' geometry and appearance, with applications in vision, graphics, and robotics. Recent works propose understanding NeRFs with natural language using Multimodal Large Language Models (MLLMs) that directly process the weights of a NeRF's MLP. However, these approaches rely on a global representation of the input object, making them unsuitable for spatial reasoning and capturing fine-grained details.

Contrarily, we propose weights2space, a self-supervised framework featuring a novel meta-encoder that can compute a sequence of spatial tokens directly from the weights of a NeRF. Leveraging this representation, we build Spatial LLaNA, a novel MLLM for NeRFs, capable of understanding details and spatial relationships in objects represented as NeRFs.

We evaluate Spatial LLaNA on NeRF captioning and NeRF Q&A tasks, using both existing benchmarks and our novel Spatial ObjaNeRF dataset consisting of 100 manually-curated language annotations for NeRFs. The latter includes 3D models and descriptions that highlight the spatial reasoning capability of MLLMs. Spatial LLaNA outperforms existing approaches across all tasks.

weights2space: Spatially-aware Tokenization

We introduce weights2space, a self-supervised framework that transforms the weights of a NeRF into a sequence of spatially-aware embeddings. Unlike previous methods that pool weights into a single global vector, our meta-encoder employs a weights2seq module to process weights into tokens, followed by a seq2space Transformer decoder to map them into a sequence of learnable queries.

These tokens are reshaped into a tri-plane representation to enforce spatial structure. The system is trained end-to-end using a decoder that renders images from these tri-planes, allowing the meta-encoder to learn localized, spatially-grounded representations of the input NeRF without human supervision.

weights2space overview

Spatial LLaNA Architecture

Spatial LLaNA (S-LLaNA) is the first MLLM for NeRFs that leverages spatially-aware tokens. It takes the tokens computed by the frozen weights2space meta-encoder and projects them into the embedding space of a Large Language Model (LLaMA 2) using a trainable projector network Ψ.

The model is trained in two stages: first optimizing the projector for alignment, and then fine-tuning the LLM for complex reasoning tasks such as captioning and Q&A. This architecture allows S-LLaNA to reason about fine-grained object details and spatial relationships significantly better than global representation methods.

Spatial LLaNA architecture

Spatial ObjaNeRF Dataset

To rigorously evaluate the ability of NeRF-Language models to move beyond simple object recognition and capture complex spatial relationships, we introduce Spatial ObjaNeRF, a manually curated benchmark.

This dataset consists of 100 human-annotated 3D models selected from ObjaNeRF-Text. Crucially, to fully assess the spatial reasoning capability of Multi-modal Large Language Models (MLLMs), every data sample represents a complex scene featuring an arrangement of multiple interacting objects. For each scene, we provide detailed textual descriptions emphasizing spatial structure, highlighting the size, shape, and relative positioning of specific objects.

Results

Results

Related Works

Other recent works have explored the use of LLM to reason on 3D world.

LLaNA is the first Multimodal Large Language Model for NeRFs, using a global representation computed from the weights of the input NeRF. PointLLM and GPT4Point achieve 3D-language understanding, leveraging colored point clouds as input data representation. LLM-Grounder proposes a method for performing Open-Vocabulary 3D Visual Grounding based on OpenScene and LERF, leveraging multi-view images and point clouds as input data representation. In contrast, LLaNA considers NeRF as the only input modality.

BibTeX

@InProceedings{Amaduzzi_NeurIPS_2025,
        author    = {Amaduzzi, Andrea and Zama Ramirez, Pierluigi and Lisanti, Giuseppe and Salti, Samuele and Di Stefano, Luigi},
        title     = {Spatially-aware Weights Tokenization for NeRF-Language Models},
        booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
        year      = {2025}
  }