Skeletal Gesture Recognition Based on Joint Spatio-Temporal and Multi-Modal Learning

Loading...
Thumbnail Image
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association
Abstract
Hand skeleton-based gesture recognition is a crucial task in human-computer interaction and virtual reality. It aims to achieve precise classification by analyzing the spatio-temporal dynamics of skeleton joints. However, existing methods struggle to effectively model highly entangled spatio-temporal features and fuse heterogeneous Joint, Bone, and Motion (J/B/JM) modalities. These limitations hinder recognition performance. To address these challenges, we propose an Adaptive Spatio-Temporal Network (ASTD-Net) for gesture recognition. Our approach centers on integrated spatio-temporal feature learning and collaborative optimization. First, for spatial feature learning, we design an Adaptive Multi-Subgraph Convolution Module (AMS-GCN) which mitigates spatial coupling interference and enhances structural representation. Subsequently, for temporal feature learning, we introduce a Multi-Scale Dilated Temporal Fusion Module (MD-TFN) that captures multi-granularity temporal patterns, spanning local details to global evolution. This allows for comprehensive modeling of temporal dependencies. Finally, we propose a Self-Supervised Spatio-Temporal Channel Adaptation Module (SSTC-A). Using a temporal discrepancy loss, SSTC-A dynamically optimizes cross-modal dependencies and strengthens alignment between heterogeneous J/B/JM features, enhancing their fusion. On the SHREC'17 and DHG-14/28 datasets, ASTD-Net achieves recognition accuracies of 97.50% and 93.57%, respectively. This performance surpasses current state-of-the-art methods by up to 0.50% and 1.07%. These results verify the effectiveness and superiority of our proposed method.
Description

CCS Concepts: Computing methodologies → Activity recognition and understanding; Neural networks

        
@inproceedings{
10.2312:pg.20251271
, booktitle = {
Pacific Graphics Conference Papers, Posters, and Demos
}, editor = {
Christie, Marc
and
Han, Ping-Hsuan
and
Lin, Shih-Syun
and
Pietroni, Nico
and
Schneider, Teseo
and
Tsai, Hsin-Ruey
and
Wang, Yu-Shuen
and
Zhang, Eugene
}, title = {{
Skeletal Gesture Recognition Based on Joint Spatio-Temporal and Multi-Modal Learning
}}, author = {
Yu, Zhijing
and
Zhu, Zhongjie
and
Ge, Di
and
Tu, Renwei
and
Bai, Yongqiang
and
Yang, Yueping
and
Wang, Yuer
}, year = {
2025
}, publisher = {
The Eurographics Association
}, ISBN = {
978-3-03868-295-0
}, DOI = {
10.2312/pg.20251271
} }
Citation