Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

Jun Li1,2, Che Liu3, Wenjia Bai3, Mingxuan Liu4,
Rossella Arcucci3, Cosmin I. Bercea*1,5 Julia A. Schnabel*1,2,5,6

*Shared senior authors.
1 Technical University of Munich 2 Munich Center for Machine Learning 3 Imperial College London
4 University of Trento 5 Helmholtz AI and Helmholtz Munich 6 King's College London

Welcome to Try

Overview

Overview of Knowledge to Sight (K2Sight). (Top): Complex medical terminology is distilled into attribute-based visual instructions, bridging knowledge to sight for abnormality grounding. (Bottom): Our K2Sight-Lite, a 0.23B model trained on only 1.5% of the data, outperforms the 7.0B SOTA medical VLM trained on 1 million samples.

Teaser GIF

Method

Overview of the K2Sight framework. Top: Knowledge Decomposition. Clinical definitions are retrieved and decomposed into shape, intensity, density, and location as core visual attributes. A large language model generates attribute-specific prompts, and human evaluation selects the most faithful and discriminative ones. Bottom: Semantic-Guided Training. Each image is paired with the selected prompts and used to train a vision-language model for abnormality grounding.

MY ALT TEXT

Results

(1) Comparative performance analysis with other state-of-the-art (SOTA) methods and ablation studies of our framework. (2) Visualization of results. We show results on both our K2Sight-lite and base model with other SOTA methods.

BibTeX


      @misc{li2025knowledgesightreasoningvisual,
      title={Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding}, 
      author={Jun Li and Che Liu and Wenjia Bai and Mingxuan Liu and Rossella Arcucci and Cosmin I. Bercea and Julia A. Schnabel},
      year={2025},
      eprint={2508.04572},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.04572}, }

      @misc{li2025enhancingabnormalitygroundingvision,
        title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions}, 
        author={Jun Li and Che Liu and Wenjia Bai and Rossella Arcucci and Cosmin I. Bercea and Julia A. Schnabel},
        year={2025},
        eprint={2503.03278},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.03278}, }