CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Jin, Qiangguo; Zheng, Xianyao; Cui, Hui; Sun, Changming; Fang, Yuqi; Cong, Cong; Su, Ran; Wei, Leyi; Xuan, Ping; Wang, Junbo

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Files

pg20251303.pdf (757.4 KB)

Date

2025

Authors

Jin, Qiangguo
Zheng, Xianyao
Cui, Hui
Sun, Changming
Fang, Yuqi
Cong, Cong
Su, Ran
Wei, Leyi
Xuan, Ping
Wang, Junbo

Publisher

The Eurographics Association

Abstract

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answerenhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

CCS Concepts: Computing methodologies → Medical visual question answering; Multi-task learning; Cross-Mamba interaction

        @inproceedings{10.2312:pg.20251303
,
booktitle = {Pacific Graphics Conference Papers, Posters, and Demos
},
editor = {Christie, Marc and 
Han, Ping-Hsuan and 
Lin, Shih-Syun and 
Pietroni, Nico and 
Schneider, Teseo and 
Tsai, Hsin-Ruey and 
Wang, Yu-Shuen and 
Zhang, Eugene
},
title = {{CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering
}},
author = {Jin, Qiangguo and 
Zheng, Xianyao and 
Cui, Hui and 
Sun, Changming and 
Fang, Yuqi and 
Cong, Cong and 
Su, Ran and 
Wei, Leyi and 
Xuan, Ping and 
Wang, Junbo
},
year = {2025
},
publisher = {The Eurographics Association
},
ISBN = {978-3-03868-295-0
},
DOI = {10.2312/pg.20251303
}
}

URI

https://doi.org/10.2312/pg.20251303
https://diglib.eg.org/handle/10.2312/pg20251303

Collections

PG2025 Conference Papers, Posters, and Demos

Full item page