NAF

Loick Chambon Paul Couairon Eloi Zablocki Alexandre Boulch Nicolas Thome Matthieu Cord

arXiv 2025

Abstract

Vision Foundation Models (VFMs) produce downsampled spatial features, which are challenging for pixel-level tasks. Existing upsampling methods either rely on fixed classical filters (bilinear, bicubic, joint bilateral) or require learnable, VFM-specific retraining (FeatUp, LiFT, JAFAR). We introduce NAF — Neighborhood Attention Filtering — a zero-shot, VFM-agnostic upsampler that leverages Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE) to learn adaptive spatial-and-content weights guided solely by the high-resolution input image. NAF scales up to 2K feature maps efficiently (~18 FPS) and consistently outperforms previous methods across multiple downstream tasks, including semantic segmentation, depth estimation, zero-shot open vocabulary, and video propagation. It also demonstrates strong performance for image restoration.

NAF enables zero-shot feature upsampling for any Vision Foundation Model, producing high-resolution features without retraining.

Results

NAF allows efficient zero-shot upsampling of any VFM features to high-resolution. It achieves state-of-the-art performance across multiple downstream tasks while remaining computationally efficient.

Summary of NAF performance across downstream tasks: semantic segmentation, depth estimation, open-vocabulary segmentation, and video propagation.

BibTeX

@misc{chambon2025nafzeroshotfeatureupsampling,
      title={NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering}, 
      author={Loick Chambon and Paul Couairon and Eloi Zablocki and Alexandre Boulch and Nicolas Thome and Matthieu Cord},
      year={2025},
      url={https://arxiv.org/abs/2511.18452}, 
}