Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

¹ Technical University of Munich ² Munich Center for Machine Learning ³ Imperial College London

⁴ University of Trento ⁵ Helmholtz AI and Helmholtz Munich ⁶ King's College London

Overview

Overview of Knowledge to Sight (K2Sight). (Top): Complex medical terminology is distilled into attribute-based visual instructions, bridging knowledge to sight for abnormality grounding. (Bottom): Our K2Sight-Lite, a 0.23B model trained on only 1.5% of the data, outperforms the 7.0B SOTA medical VLM trained on 1 million samples.

Method

Overview of the K2Sight framework. Top: Knowledge Decomposition. Clinical definitions are retrieved and decomposed into shape, intensity, density, and location as core visual attributes. A large language model generates attribute-specific prompts, and human evaluation selects the most faithful and discriminative ones. Bottom: Semantic-Guided Training. Each image is paired with the selected prompts and used to train a vision-language model for abnormality grounding.

Results

(1) Comparative performance analysis with other state-of-the-art (SOTA) methods and ablation studies of our framework. (2) Visualization of results. We show results on both our K2Sight-lite and base model with other SOTA methods.

BibTeX

@misc{li2025knowledgesightreasoningvisual, title={Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding}, author={Jun Li and Che Liu and Wenjia Bai and Mingxuan Liu and Rossella Arcucci and Cosmin I. Bercea and Julia A. Schnabel}, year={2025}, eprint={2508.04572}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.04572}, } @misc{li2025enhancingabnormalitygroundingvision, title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions}, author={Jun Li and Che Liu and Wenjia Bai and Rossella Arcucci and Cosmin I. Bercea and Julia A. Schnabel}, year={2025}, eprint={2503.03278}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.03278}, }

Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

Welcome to Try

Overview

Method

Results

BibTeX