NAF

Zero-Shot Feature Upsampling

NAF

Loick Chambon   Paul Couairon    Eloi Zablocki   Alexandre Boulch      Nicolas Thome Matthieu Cord

arXiv 2025

Paper   Code   


Abstract

Vision Foundation Models (VFMs) produce downsampled spatial features, which are challenging for pixel-level tasks. Existing upsampling methods either rely on fixed classical filters (bilinear, bicubic, joint bilateral) or require learnable, VFM-specific retraining (FeatUp, LiFT, JAFAR). We introduce NAF — Neighborhood Attention Filtering — a zero-shot, VFM-agnostic upsampler that leverages Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE) to learn adaptive spatial-and-content weights guided solely by the high-resolution input image. NAF scales up to 2K feature maps efficiently (~18 FPS) and consistently outperforms previous methods across multiple downstream tasks, including semantic segmentation, depth estimation, zero-shot open vocabulary, and video propagation. It also demonstrates strong performance for image restoration.


NAF enables zero-shot feature upsampling for any Vision Foundation Model, producing high-resolution features without retraining.

Results

NAF allows efficient zero-shot upsampling of any VFM features to high-resolution. It achieves state-of-the-art performance across multiple downstream tasks while remaining computationally efficient.

Summary of NAF performance across downstream tasks: semantic segmentation, depth estimation, open-vocabulary segmentation, and video propagation.

BibTeX

@misc{chambon2025nafzeroshotfeatureupsampling,
      title={NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering}, 
      author={Loick Chambon and Paul Couairon and Eloi Zablocki and Alexandre Boulch and Nicolas Thome and Matthieu Cord},
      year={2025},
      url={https://arxiv.org/abs/2511.18452}, 
}