Li, BoxuanLiu, JuanMasia, BelenThies, Justus2026-04-172026-04-1720261467-8659https://diglib.eg.org/handle/10.1111/cgf70368https://doi.org/10.1111/cgf.70368Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints for objects of any category using only a few labeled samples, making it a challenging yet crucial task for general-purpose visual understanding. Existing methods rely on either visual or textual inputs, but the lack of cross-modal interaction limits generalization. Without a unified input representation, solely using visual features hinders consistent prediction of same-type keypoints, while fixed textual representations fail to capture the diverse characteristics of same-type keypoints, leading to coarse and over-generalized outputs. To address these limitations, we propose two multi-modal frameworks that perform visual-textual integration at both the feature and decision levels. Our feature-level module leverages cross-modal attention to align and enhance keypoint representations, while the decision-level fusion adaptively combines modality-specific predictions through a modality-consistency loss. Experiments on the large-scale MP-100 dataset demonstrate that our method surpasses existing baselines in both accuracy and robustness. Under the challenging 1-shot setting, our model achieves a 0.58% improvement in PCK0.2 over the state-of-the-art CAPE method.CC-BY-4.0Computing methodologies → Computer visionMachine learningScene understandingMulti-task learningNeural networksCCS ConceptsEnhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment10.1111/cgf.703689 pages