CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering
| dc.contributor.author | Jin, Qiangguo | en_US |
| dc.contributor.author | Zheng, Xianyao | en_US |
| dc.contributor.author | Cui, Hui | en_US |
| dc.contributor.author | Sun, Changming | en_US |
| dc.contributor.author | Fang, Yuqi | en_US |
| dc.contributor.author | Cong, Cong | en_US |
| dc.contributor.author | Su, Ran | en_US |
| dc.contributor.author | Wei, Leyi | en_US |
| dc.contributor.author | Xuan, Ping | en_US |
| dc.contributor.author | Wang, Junbo | en_US |
| dc.contributor.editor | Christie, Marc | en_US |
| dc.contributor.editor | Han, Ping-Hsuan | en_US |
| dc.contributor.editor | Lin, Shih-Syun | en_US |
| dc.contributor.editor | Pietroni, Nico | en_US |
| dc.contributor.editor | Schneider, Teseo | en_US |
| dc.contributor.editor | Tsai, Hsin-Ruey | en_US |
| dc.contributor.editor | Wang, Yu-Shuen | en_US |
| dc.contributor.editor | Zhang, Eugene | en_US |
| dc.date.accessioned | 2025-10-07T06:05:09Z | |
| dc.date.available | 2025-10-07T06:05:09Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answerenhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL. | en_US |
| dc.description.sectionheaders | Visualization | |
| dc.description.seriesinformation | Pacific Graphics Conference Papers, Posters, and Demos | |
| dc.identifier.doi | 10.2312/pg.20251303 | |
| dc.identifier.isbn | 978-3-03868-295-0 | |
| dc.identifier.pages | 7 pages | |
| dc.identifier.uri | https://doi.org/10.2312/pg.20251303 | |
| dc.identifier.uri | https://diglib.eg.org/handle/10.2312/pg20251303 | |
| dc.publisher | The Eurographics Association | en_US |
| dc.rights | Attribution 4.0 International License | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | CCS Concepts: Computing methodologies → Medical visual question answering; Multi-task learning; Cross-Mamba interaction | |
| dc.subject | Computing methodologies → Medical visual question answering | |
| dc.subject | Multi | |
| dc.subject | task learning | |
| dc.subject | Cross | |
| dc.subject | Mamba interaction | |
| dc.title | CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering | en_US |
Files
Original bundle
1 - 1 of 1