Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment

Li, Boxuan; Liu, Juan

Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment

dc.contributor.author	Li, Boxuan
dc.contributor.author	Liu, Juan
dc.contributor.editor	Masia, Belen
dc.contributor.editor	Thies, Justus
dc.date.accessioned	2026-04-17T12:22:10Z
dc.date.available	2026-04-17T12:22:10Z
dc.date.issued	2026
dc.description.abstract	Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints for objects of any category using only a few labeled samples, making it a challenging yet crucial task for general-purpose visual understanding. Existing methods rely on either visual or textual inputs, but the lack of cross-modal interaction limits generalization. Without a unified input representation, solely using visual features hinders consistent prediction of same-type keypoints, while fixed textual representations fail to capture the diverse characteristics of same-type keypoints, leading to coarse and over-generalized outputs. To address these limitations, we propose two multi-modal frameworks that perform visual-textual integration at both the feature and decision levels. Our feature-level module leverages cross-modal attention to align and enhance keypoint representations, while the decision-level fusion adaptively combines modality-specific predictions through a modality-consistency loss. Experiments on the large-scale MP-100 dataset demonstrate that our method surpasses existing baselines in both accuracy and robustness. Under the challenging 1-shot setting, our model achieves a 0.58% improvement in PCK0.2 over the state-of-the-art CAPE method.
dc.description.number	2
dc.description.sectionheaders	Temporal Vision: Video Generation, Pose, and Narrative
dc.description.seriesinformation	Computer Graphics Forum
dc.description.volume	45
dc.identifier.doi	10.1111/cgf.70368
dc.identifier.issn	1467-8659
dc.identifier.pages	9 pages
dc.identifier.uri	https://diglib.eg.org/handle/10.1111/cgf70368
dc.identifier.uri	https://doi.org/10.1111/cgf.70368
dc.publisher	The Eurographics Association and John Wiley & Sons Ltd.
dc.rights	CC-BY-4.0
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Computing methodologies → Computer vision
dc.subject	Machine learning
dc.subject	Scene understanding
dc.subject	Multi-task learning
dc.subject	Neural networks
dc.subject	CCS Concepts
dc.title	Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment

Files

Original bundle

Now showing 1 - 1 of 1

Name:: cgf70368.pdf
Size:: 3.31 MB
Format:: Adobe Portable Document Format

Download

Collections

45-Issue 2
EG 2026 - Full Papers - CGF 45-Issue 2