Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment
| dc.contributor.author | Li, Boxuan | |
| dc.contributor.author | Liu, Juan | |
| dc.contributor.editor | Masia, Belen | |
| dc.contributor.editor | Thies, Justus | |
| dc.date.accessioned | 2026-04-17T12:22:10Z | |
| dc.date.available | 2026-04-17T12:22:10Z | |
| dc.date.issued | 2026 | |
| dc.description.abstract | Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints for objects of any category using only a few labeled samples, making it a challenging yet crucial task for general-purpose visual understanding. Existing methods rely on either visual or textual inputs, but the lack of cross-modal interaction limits generalization. Without a unified input representation, solely using visual features hinders consistent prediction of same-type keypoints, while fixed textual representations fail to capture the diverse characteristics of same-type keypoints, leading to coarse and over-generalized outputs. To address these limitations, we propose two multi-modal frameworks that perform visual-textual integration at both the feature and decision levels. Our feature-level module leverages cross-modal attention to align and enhance keypoint representations, while the decision-level fusion adaptively combines modality-specific predictions through a modality-consistency loss. Experiments on the large-scale MP-100 dataset demonstrate that our method surpasses existing baselines in both accuracy and robustness. Under the challenging 1-shot setting, our model achieves a 0.58% improvement in PCK0.2 over the state-of-the-art CAPE method. | |
| dc.description.number | 2 | |
| dc.description.sectionheaders | Temporal Vision: Video Generation, Pose, and Narrative | |
| dc.description.seriesinformation | Computer Graphics Forum | |
| dc.description.volume | 45 | |
| dc.identifier.doi | 10.1111/cgf.70368 | |
| dc.identifier.issn | 1467-8659 | |
| dc.identifier.pages | 9 pages | |
| dc.identifier.uri | https://diglib.eg.org/handle/10.1111/cgf70368 | |
| dc.identifier.uri | https://doi.org/10.1111/cgf.70368 | |
| dc.publisher | The Eurographics Association and John Wiley & Sons Ltd. | |
| dc.rights | CC-BY-4.0 | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Computing methodologies → Computer vision | |
| dc.subject | Machine learning | |
| dc.subject | Scene understanding | |
| dc.subject | Multi-task learning | |
| dc.subject | Neural networks | |
| dc.subject | CCS Concepts | |
| dc.title | Enhancing Robust Category-Agnostic Pose Estimation through Multi-Modal Feature Alignment |
Files
Original bundle
1 - 1 of 1