CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Jin, Qiangguo; Zheng, Xianyao; Cui, Hui; Sun, Changming; Fang, Yuqi; Cong, Cong; Su, Ran; Wei, Leyi; Xuan, Ping; Wang, Junbo

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

dc.contributor.author	Jin, Qiangguo	en_US
dc.contributor.author	Zheng, Xianyao	en_US
dc.contributor.author	Cui, Hui	en_US
dc.contributor.author	Sun, Changming	en_US
dc.contributor.author	Fang, Yuqi	en_US
dc.contributor.author	Cong, Cong	en_US
dc.contributor.author	Su, Ran	en_US
dc.contributor.author	Wei, Leyi	en_US
dc.contributor.author	Xuan, Ping	en_US
dc.contributor.author	Wang, Junbo	en_US
dc.contributor.editor	Christie, Marc	en_US
dc.contributor.editor	Han, Ping-Hsuan	en_US
dc.contributor.editor	Lin, Shih-Syun	en_US
dc.contributor.editor	Pietroni, Nico	en_US
dc.contributor.editor	Schneider, Teseo	en_US
dc.contributor.editor	Tsai, Hsin-Ruey	en_US
dc.contributor.editor	Wang, Yu-Shuen	en_US
dc.contributor.editor	Zhang, Eugene	en_US
dc.date.accessioned	2025-10-07T06:05:09Z
dc.date.available	2025-10-07T06:05:09Z
dc.date.issued	2025
dc.description.abstract	Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answerenhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.	en_US
dc.description.sectionheaders	Visualization
dc.description.seriesinformation	Pacific Graphics Conference Papers, Posters, and Demos
dc.identifier.doi	10.2312/pg.20251303
dc.identifier.isbn	978-3-03868-295-0
dc.identifier.pages	7 pages
dc.identifier.uri	https://doi.org/10.2312/pg.20251303
dc.identifier.uri	https://diglib.eg.org/handle/10.2312/pg20251303
dc.publisher	The Eurographics Association	en_US
dc.rights	Attribution 4.0 International License
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	CCS Concepts: Computing methodologies → Medical visual question answering; Multi-task learning; Cross-Mamba interaction
dc.subject	Computing methodologies → Medical visual question answering
dc.subject	Multi
dc.subject	task learning
dc.subject	Cross
dc.subject	Mamba interaction
dc.title	CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: pg20251303.pdf
Size:: 757.4 KB
Format:: Adobe Portable Document Format

Download

Collections

PG2025 Conference Papers, Posters, and Demos