A Multimodal Dataset for Dialogue Intent Recognition through Human Movement and Nonverbal Cues

Lin, Shu-WeiZhang, Jia-XiangLu, Jun-Fu LinHuang, Yi-JhengZhang, JunpoChristie, MarcHan, Ping-HsuanLin, Shih-SyunPietroni, NicoSchneider, TeseoTsai, Hsin-RueyWang, Yu-ShuenZhang, Eugene2025-10-072025-10-072025978-3-03868-295-0https://doi.org/10.2312/pg.20251310https://diglib.eg.org/handle/10.2312/pg20251310This paper presents a multimodal dataset designed to advance dialogue intent recognition through skeleton-based representations and temporal human movement features. Rather than proposing a new model, our objective is to provide a high-quality, annotated dataset that captures subtle nonverbal cues preceding human speech and interaction. The dataset includes skeletal joint coordinates, facial orientation, and contextual object data (e.g., microphone positions), collected from diverse participants across varied conversational scenarios. In the future research, we will benchmark three types of learning methods and offer comparative insights. The benchmark three types of learning methods will be handcrafted feature models, sequence models (LSTM), and graph-based models (GCN). This resource aims to facilitate the development of more natural, sensor-free, and data-driven human-computer interaction systems by providing a robust foundation for training and evaluation.Attribution 4.0 International LicenseA Multimodal Dataset for Dialogue Intent Recognition through Human Movement and Nonverbal Cues10.2312/pg.202513102 pages