Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment
Loading...
Date
2026
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association and John Wiley & Sons Ltd.
Abstract
Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints for objects of any category using only a few labeled samples, making it a challenging yet crucial task for general-purpose visual understanding. Existing methods rely on either visual or textual inputs, but the lack of cross-modal interaction limits generalization. Without a unified input representation, solely using visual features hinders consistent prediction of same-type keypoints, while fixed textual representations fail to capture the diverse characteristics of same-type keypoints, leading to coarse and over-generalized outputs. To address these limitations, we propose two multi-modal frameworks that perform visual-textual integration at both the feature and decision levels. Our feature-level module leverages cross-modal attention to align and enhance keypoint representations, while the decision-level fusion adaptively combines modality-specific predictions through a modality-consistency loss. Experiments on the large-scale MP-100 dataset demonstrate that our method surpasses existing baselines in both accuracy and robustness. Under the challenging 1-shot setting, our model achieves a 0.58% improvement in PCK0.2 over the state-of-the-art CAPE method.
Description
@article{10.1111:cgf.70368,
journal = {Computer Graphics Forum},
title = {{Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment}},
author = {Li, Boxuan and Liu, Juan},
year = {2026},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/cgf.70368}
}
