Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment

dc.contributor.authorLi, Boxuan
dc.contributor.authorLiu, Juan
dc.contributor.editorMasia, Belen
dc.contributor.editorThies, Justus
dc.date.accessioned2026-04-17T12:22:10Z
dc.date.available2026-04-17T12:22:10Z
dc.date.issued2026
dc.description.abstractCategory-Agnostic Pose Estimation (CAPE) aims to detect keypoints for objects of any category using only a few labeled samples, making it a challenging yet crucial task for general-purpose visual understanding. Existing methods rely on either visual or textual inputs, but the lack of cross-modal interaction limits generalization. Without a unified input representation, solely using visual features hinders consistent prediction of same-type keypoints, while fixed textual representations fail to capture the diverse characteristics of same-type keypoints, leading to coarse and over-generalized outputs. To address these limitations, we propose two multi-modal frameworks that perform visual-textual integration at both the feature and decision levels. Our feature-level module leverages cross-modal attention to align and enhance keypoint representations, while the decision-level fusion adaptively combines modality-specific predictions through a modality-consistency loss. Experiments on the large-scale MP-100 dataset demonstrate that our method surpasses existing baselines in both accuracy and robustness. Under the challenging 1-shot setting, our model achieves a 0.58% improvement in PCK0.2 over the state-of-the-art CAPE method.
dc.description.number2
dc.description.sectionheadersTemporal Vision: Video Generation, Pose, and Narrative
dc.description.seriesinformationComputer Graphics Forum
dc.description.volume45
dc.identifier.doi10.1111/cgf.70368
dc.identifier.issn1467-8659
dc.identifier.pages9 pages
dc.identifier.urihttps://diglib.eg.org/handle/10.1111/cgf70368
dc.identifier.urihttps://doi.org/10.1111/cgf.70368
dc.publisherThe Eurographics Association and John Wiley & Sons Ltd.
dc.rightsCC-BY-4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectComputing methodologies → Computer vision
dc.subjectMachine learning
dc.subjectScene understanding
dc.subjectMulti-task learning
dc.subjectNeural networks
dc.subjectCCS Concepts
dc.titleEnhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
cgf70368.pdf
Size:
3.31 MB
Format:
Adobe Portable Document Format