# INSID3 **Repository Path**: weiwei16/INSID3 ## Basic Information - **Project Name**: INSID3 - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-14 - **Last Updated**: 2026-04-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
# INSID3: Training-Free In-Context Segmentation with DINOv3

Paper arXiv Project Page

**[Claudia Cuttano](https://scholar.google.com/citations?user=W7lNKNsAAAAJ)1,2 · [Gabriele Trivigno](https://scholar.google.com/citations?user=JXf_iToAAAAJ)1 · [Christoph Reich](https://christophreich1996.github.io)2,3,5,6 · [Daniel Cremers](https://scholar.google.com/citations?user=cXQciMEAAAAJ&hl=en)3,5,6 · [Carlo Masone](https://scholar.google.com/citations?user=cM3Iz_4AAAAJ)1 · [Stefan Roth](https://scholar.google.com/citations?user=0yDoR0AAAAAJ&hl=en)2,4,5** 1 Politecnico di Torino    2 TU Darmstadt    3 TU Munich    4 hessian.AI    5 ELIZA    6 MCML ✨ **CVPR 2026 ORAL** ✨
INSID3 solves in-context segmentation entirely within a single frozen DINOv3 backbone: 🚀 **Training-free:** no fine-tuning, no segmentation decoder, no auxiliary models 🔍 **Insight:** we uncover and fix a positional bias in DINOv3 features, improving their reliability beyond segmentation 📈 **State-of-the-art, smaller & faster:** outperforms both training-free and specialized methods while using a single backbone 🌍 **Generalizes broadly:** from object-level to part-level and personalized segmentation, across natural, medical, underwater, and aerial domains

## ⚙️ Environment Setup INSID3 can be set up either with **Conda** or with **uv**. Choose one of the following options. ### Option 1: Conda To get started, create a Conda environment and install the required dependencies. The experiments in the paper were run with **PyTorch 2.7.1 (CUDA 12.6)**, which we provide as a reference configuration. To set up the environment using Conda, run: ```bash conda create --name insid3 python=3.10 -y conda activate insid3 pip install -r requirements.txt ``` **Optional:** If you want to use CRF-based mask refinement, also install: ```bash git clone https://github.com/netw0rkf10w/CRF.git cd CRF python setup.py install cd .. ``` ### Option 2: uv As an alternative to Conda, you can use [`uv`](https://docs.astral.sh/uv/#highlights), a fast Python package and environment manager. In this setup, the optional CRF dependency is already included. On macOS and Linux: ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` Then run: ```bash # Ensure CUDA 12.6 is loaded beforehand # This will automatically create a virtual environment (.venv) and install dependencies from pyproject.toml uv sync source .venv/bin/activate ``` ## 🧱 DINOv3 Weights INSID3 relies on a **frozen DINOv3 backbone**. Please download the pretrained weights from the official repository: 👉 https://github.com/facebookresearch/dinov3 Create the ```pretrain``` directory: ```bash mkdir -p pretrain ``` Place the weights of the backbone you want to use in the ```pretrain/``` folder: ``` pretrain/dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth pretrain/dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth pretrain/dinov3_vits16_pretrain_lvd1689m-08c60483.pth ``` By default, we use the Large model (```dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth```). ## 📍 Minimal Usage Here is a minimal example to segment a target image given a reference image and its mask. ```python from models import build_insid3 from utils.visualization import visualize_prediction ref_image_path, ref_mask_path = "assets/ref_cat_image.jpg", "assets/ref_cat_mask.png" target_image_path = "assets/target_cat_image.jpg" output_path = "target_cat_pred.png" # Build model model = build_insid3() # Set reference and target model.set_reference(ref_image_path, ref_mask_path) model.set_target(target_image_path) # Predict pred_mask = model.segment() # Save visualization visualize_prediction( ref_image_path, ref_mask_path, target_image_path, pred_mask, output_path, ) ``` To refine the predicted mask with CRF, initialize the model with: `model = build_insid3(mask_refiner="crf")`. ## 📦 Data Please refer to [docs/data.md](docs/data.md) for dataset preparation instructions. ## 🚀 Inference Evaluate INSID3: ```bash python inference.py --dataset coco --exp-name insid3-coco ``` #### Main arguments: - `--dataset`: supported [`coco`, `lvis`, `pascal_part`, `paco_part`, `isaid`, `isic`, `lung`, `suim`, `permis`] - `--model_size`: DINOv3 backbone size (`small`, `base`, `large`, default: `large`) - `--shots`: number of reference images per episode (e.g., 1-shot, 5-shot, default: 1) - Other args: hyperparameters (e.g., `--tau`, `--merge-thresh`, `--svd-comps`) have default values as in the paper; pass them to override the defaults. See `opts.py`. **Note:** By default, the predicted mask is upsampled to the original image resolution using **bilinear interpolation**. For additional refinement, enable **CRF-based refinement** with `--crf-mask-refinement`. ## 💡 Why INSID3 Works INSID3 builds on two key observations about DINOv3 features. (i) **Dense DINOv3 features** naturally induce a **structured decomposition of the scene**. By clustering them, we obtain coherent object- and part-level regions without supervision.

Dense DINOv3 features

(ii) Besides semantic matches, DINOv3 also **responds to absolute image position**. Given a patch on the bird’s tail in the reference image, the DINOv3 similarity map activates on (i) the tail in the target image, but also (ii) over the left portion of the image.

Dense DINOv3 features

PCA on low-semantic-content images reveals that this effect lives in a stable **low-dimensional subspace**. INSID3 removes it in a training-free way: we identify the positional component of DINOv3 features and project onto its **orthogonal complement**. This suppresses coordinate-driven responses while preserving semantics.

Dense DINOv3 features

## Citation If you find this work useful in your research, please cite: ```bibtex @inproceedings{cuttano2026insid3, title = {{INSID3}: Training-Free In-Context Segmentation with {DINOv3}}, author = {Claudia Cuttano and Gabriele Trivigno and Christoph Reich and Daniel Cremers and Carlo Masone and Stefan Roth}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} } ``` ## Acknowledgements We gratefully acknowledge the contributions of the following open-source projects: - [DINOv3](https://github.com/facebookresearch/dinov3) - [Matcher](https://github.com/aim-uofa/Matcher)