Skeletal Gesture Recognition Based on Joint Spatio-Temporal and Multi-Modal Learning

dc.contributor.authorYu, Zhijingen_US
dc.contributor.authorZhu, Zhongjieen_US
dc.contributor.authorGe, Dien_US
dc.contributor.authorTu, Renweien_US
dc.contributor.authorBai, Yongqiangen_US
dc.contributor.authorYang, Yuepingen_US
dc.contributor.authorWang, Yueren_US
dc.contributor.editorChristie, Marcen_US
dc.contributor.editorHan, Ping-Hsuanen_US
dc.contributor.editorLin, Shih-Syunen_US
dc.contributor.editorPietroni, Nicoen_US
dc.contributor.editorSchneider, Teseoen_US
dc.contributor.editorTsai, Hsin-Rueyen_US
dc.contributor.editorWang, Yu-Shuenen_US
dc.contributor.editorZhang, Eugeneen_US
dc.date.accessioned2025-10-07T06:03:20Z
dc.date.available2025-10-07T06:03:20Z
dc.date.issued2025
dc.description.abstractHand skeleton-based gesture recognition is a crucial task in human-computer interaction and virtual reality. It aims to achieve precise classification by analyzing the spatio-temporal dynamics of skeleton joints. However, existing methods struggle to effectively model highly entangled spatio-temporal features and fuse heterogeneous Joint, Bone, and Motion (J/B/JM) modalities. These limitations hinder recognition performance. To address these challenges, we propose an Adaptive Spatio-Temporal Network (ASTD-Net) for gesture recognition. Our approach centers on integrated spatio-temporal feature learning and collaborative optimization. First, for spatial feature learning, we design an Adaptive Multi-Subgraph Convolution Module (AMS-GCN) which mitigates spatial coupling interference and enhances structural representation. Subsequently, for temporal feature learning, we introduce a Multi-Scale Dilated Temporal Fusion Module (MD-TFN) that captures multi-granularity temporal patterns, spanning local details to global evolution. This allows for comprehensive modeling of temporal dependencies. Finally, we propose a Self-Supervised Spatio-Temporal Channel Adaptation Module (SSTC-A). Using a temporal discrepancy loss, SSTC-A dynamically optimizes cross-modal dependencies and strengthens alignment between heterogeneous J/B/JM features, enhancing their fusion. On the SHREC'17 and DHG-14/28 datasets, ASTD-Net achieves recognition accuracies of 97.50% and 93.57%, respectively. This performance surpasses current state-of-the-art methods by up to 0.50% and 1.07%. These results verify the effectiveness and superiority of our proposed method.en_US
dc.description.sectionheadersDetecting & Estimating from images
dc.description.seriesinformationPacific Graphics Conference Papers, Posters, and Demos
dc.identifier.doi10.2312/pg.20251271
dc.identifier.isbn978-3-03868-295-0
dc.identifier.pages9 pages
dc.identifier.urihttps://doi.org/10.2312/pg.20251271
dc.identifier.urihttps://diglib.eg.org/handle/10.2312/pg20251271
dc.publisherThe Eurographics Associationen_US
dc.rightsAttribution 4.0 International License
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectCCS Concepts: Computing methodologies → Activity recognition and understanding; Neural networks
dc.subjectComputing methodologies → Activity recognition and understanding
dc.subjectNeural networks
dc.titleSkeletal Gesture Recognition Based on Joint Spatio-Temporal and Multi-Modal Learningen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
pg20251271.pdf
Size:
5.96 MB
Format:
Adobe Portable Document Format