CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

dc.contributor.authorJin, Qiangguoen_US
dc.contributor.authorZheng, Xianyaoen_US
dc.contributor.authorCui, Huien_US
dc.contributor.authorSun, Changmingen_US
dc.contributor.authorFang, Yuqien_US
dc.contributor.authorCong, Congen_US
dc.contributor.authorSu, Ranen_US
dc.contributor.authorWei, Leyien_US
dc.contributor.authorXuan, Pingen_US
dc.contributor.authorWang, Junboen_US
dc.contributor.editorChristie, Marcen_US
dc.contributor.editorHan, Ping-Hsuanen_US
dc.contributor.editorLin, Shih-Syunen_US
dc.contributor.editorPietroni, Nicoen_US
dc.contributor.editorSchneider, Teseoen_US
dc.contributor.editorTsai, Hsin-Rueyen_US
dc.contributor.editorWang, Yu-Shuenen_US
dc.contributor.editorZhang, Eugeneen_US
dc.date.accessioned2025-10-07T06:05:09Z
dc.date.available2025-10-07T06:05:09Z
dc.date.issued2025
dc.description.abstractMedical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answerenhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.en_US
dc.description.sectionheadersVisualization
dc.description.seriesinformationPacific Graphics Conference Papers, Posters, and Demos
dc.identifier.doi10.2312/pg.20251303
dc.identifier.isbn978-3-03868-295-0
dc.identifier.pages7 pages
dc.identifier.urihttps://doi.org/10.2312/pg.20251303
dc.identifier.urihttps://diglib.eg.org/handle/10.2312/pg20251303
dc.publisherThe Eurographics Associationen_US
dc.rightsAttribution 4.0 International License
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectCCS Concepts: Computing methodologies → Medical visual question answering; Multi-task learning; Cross-Mamba interaction
dc.subjectComputing methodologies → Medical visual question answering
dc.subjectMulti
dc.subjecttask learning
dc.subjectCross
dc.subjectMamba interaction
dc.titleCMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answeringen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
pg20251303.pdf
Size:
757.4 KB
Format:
Adobe Portable Document Format