Dynamic Decision Learning:
Test-Time Evolution for Abnormality Grounding in Rare Diseases

Jun Li1,2, Mingxuan Liu3, Jiazhen Pan1,2, Che Liu4,
Wenjia Bai4, Cosmin I. Bercea*1,2, Julia A. Schnabel*1,2,5,6

*Shared senior authors.
1 Technical University of Munich 2 Munich Center for Machine Learning 3 University of Trento
4 Imperial College London 5 Helmholtz Munich 6 King's College London

Overview

Large vision-language models struggle with clinical abnormality grounding on rare diseases and long-tailed distributions. Severe data scarcity and distribution shifts make fine-tuning impractical, while single-pass inference is highly sensitive to prompt phrasing and visual perturbations. We propose Dynamic Decisions Learning (DDL), a novel framework that optimizes instruction prompts and consolidates predictions across augmented views using structured matching. Across two brain imaging benchmarks with 281 rare pathologies, DDL yields up to 105% improvement in localization and consistently outperforms other methods.
Teaser

Top: Static inference with frozen LVLMs suffers from fragility and hallucinations on rare diseases, exhibiting high sensitivity to prompt and visual variations.

Bottom: DDL enables test-time dynamic evolution by synergizing the Language Space and Visual Space to navigate the decision landscape toward success, achieving up to a 105% improvement in localization reliability.

Method

Overview of the DDL framework. DDL enables test-time dynamic evolution by synergizing the Language Space and Visual Space to navigate the decision landscape toward success. It optimizes instruction prompts and consolidates predictions across multiple augmented views using structured matching, achieving significant improvements in localization reliability without any additional training.
Method

Experimental Results

Visualization: 3B to 72B Model Scale

3B Results

Qwen2.5-VL-3B Grounding Performance

7B Results

Qwen2.5-VL-7B Grounding Performance

32B Results

Qwen2.5-VL-32B Grounding Performance

72B Results

Qwen2.5-VL-72B Grounding Performance

Scaling Performance on NOVA (Rare Diseases)

Model Size Method mAP@25 mAP@50 mAP@75
3B Vanilla 0.221 0.088 0.040
DDL (Ours) 0.298 (+35%) 0.150 (+70%) 0.066 (+66%)
7B Vanilla 0.286 0.135 0.036
DDL (Ours) 0.369 (+29%) 0.206 (+52%) 0.075 (+105%)
32B Vanilla 0.406 0.208 0.058
DDL (Ours) 0.454 (+12%) 0.266 (+28%) 0.096 (+66%)
72B Vanilla 0.411 0.245 0.065
DDL (Ours) 0.500 (+22%) 0.301 (+23%) 0.107 (+65%)

* DDL achieves superior scaling, even enabling the 32B model to surpass the 72B Zero-Shot baseline.

Citation

@inproceedings{li2026dynamic,
  title={Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases}, 
  author={Jun Li and Mingxuan Liu and Jiazhen Pan and Che Liu and Wenjia Bai and Cosmin I. Bercea and Julia A. Schnabel},
  booktitle={arXiv},
  year={2026},
}