CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering
Loading...
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association
Abstract
Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answerenhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.
Description
CCS Concepts: Computing methodologies → Medical visual question answering; Multi-task learning; Cross-Mamba interaction
@inproceedings{10.2312:pg.20251303,
booktitle = {Pacific Graphics Conference Papers, Posters, and Demos},
editor = {Christie, Marc and Han, Ping-Hsuan and Lin, Shih-Syun and Pietroni, Nico and Schneider, Teseo and Tsai, Hsin-Ruey and Wang, Yu-Shuen and Zhang, Eugene},
title = {{CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering}},
author = {Jin, Qiangguo and Zheng, Xianyao and Cui, Hui and Sun, Changming and Fang, Yuqi and Cong, Cong and Su, Ran and Wei, Leyi and Xuan, Ping and Wang, Junbo},
year = {2025},
publisher = {The Eurographics Association},
ISBN = {978-3-03868-295-0},
DOI = {10.2312/pg.20251303}
}
