Skeletal Gesture Recognition Based on Joint Spatio-Temporal and Multi-Modal Learning
| dc.contributor.author | Yu, Zhijing | en_US |
| dc.contributor.author | Zhu, Zhongjie | en_US |
| dc.contributor.author | Ge, Di | en_US |
| dc.contributor.author | Tu, Renwei | en_US |
| dc.contributor.author | Bai, Yongqiang | en_US |
| dc.contributor.author | Yang, Yueping | en_US |
| dc.contributor.author | Wang, Yuer | en_US |
| dc.contributor.editor | Christie, Marc | en_US |
| dc.contributor.editor | Han, Ping-Hsuan | en_US |
| dc.contributor.editor | Lin, Shih-Syun | en_US |
| dc.contributor.editor | Pietroni, Nico | en_US |
| dc.contributor.editor | Schneider, Teseo | en_US |
| dc.contributor.editor | Tsai, Hsin-Ruey | en_US |
| dc.contributor.editor | Wang, Yu-Shuen | en_US |
| dc.contributor.editor | Zhang, Eugene | en_US |
| dc.date.accessioned | 2025-10-07T06:03:20Z | |
| dc.date.available | 2025-10-07T06:03:20Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Hand skeleton-based gesture recognition is a crucial task in human-computer interaction and virtual reality. It aims to achieve precise classification by analyzing the spatio-temporal dynamics of skeleton joints. However, existing methods struggle to effectively model highly entangled spatio-temporal features and fuse heterogeneous Joint, Bone, and Motion (J/B/JM) modalities. These limitations hinder recognition performance. To address these challenges, we propose an Adaptive Spatio-Temporal Network (ASTD-Net) for gesture recognition. Our approach centers on integrated spatio-temporal feature learning and collaborative optimization. First, for spatial feature learning, we design an Adaptive Multi-Subgraph Convolution Module (AMS-GCN) which mitigates spatial coupling interference and enhances structural representation. Subsequently, for temporal feature learning, we introduce a Multi-Scale Dilated Temporal Fusion Module (MD-TFN) that captures multi-granularity temporal patterns, spanning local details to global evolution. This allows for comprehensive modeling of temporal dependencies. Finally, we propose a Self-Supervised Spatio-Temporal Channel Adaptation Module (SSTC-A). Using a temporal discrepancy loss, SSTC-A dynamically optimizes cross-modal dependencies and strengthens alignment between heterogeneous J/B/JM features, enhancing their fusion. On the SHREC'17 and DHG-14/28 datasets, ASTD-Net achieves recognition accuracies of 97.50% and 93.57%, respectively. This performance surpasses current state-of-the-art methods by up to 0.50% and 1.07%. These results verify the effectiveness and superiority of our proposed method. | en_US |
| dc.description.sectionheaders | Detecting & Estimating from images | |
| dc.description.seriesinformation | Pacific Graphics Conference Papers, Posters, and Demos | |
| dc.identifier.doi | 10.2312/pg.20251271 | |
| dc.identifier.isbn | 978-3-03868-295-0 | |
| dc.identifier.pages | 9 pages | |
| dc.identifier.uri | https://doi.org/10.2312/pg.20251271 | |
| dc.identifier.uri | https://diglib.eg.org/handle/10.2312/pg20251271 | |
| dc.publisher | The Eurographics Association | en_US |
| dc.rights | Attribution 4.0 International License | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | CCS Concepts: Computing methodologies → Activity recognition and understanding; Neural networks | |
| dc.subject | Computing methodologies → Activity recognition and understanding | |
| dc.subject | Neural networks | |
| dc.title | Skeletal Gesture Recognition Based on Joint Spatio-Temporal and Multi-Modal Learning | en_US |
Files
Original bundle
1 - 1 of 1