Audio-Driven Speech Animation with Text-Guided Expression

Loading...
Thumbnail Image
Date
2024
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association
Abstract
We introduce a novel method for generating expressive speech animations of a 3D face, driven by both audio and text descriptions. Many previous approaches focused on generating facial expressions using pre-defined emotion categories. In contrast, our method is capable of generating facial expressions from text descriptions unseen during training, without limitations to specific emotion classes. Our system employs a two-stage approach. In the first stage, an auto-encoder is trained to disentangle content and expression features from facial animations. In the second stage, two transformer-based networks predict the content and expression features from audio and text inputs, respectively. These features are then passed to the decoder of the pre-trained auto-encoder, yielding the final expressive speech animation. By accommodating diverse forms of natural language, such as emotion words or detailed facial expression descriptions, our method offers an intuitive and versatile way to generate expressive speech animations. Extensive quantitative and qualitative evaluations, including a user study, demonstrate that our method can produce natural expressive speech animations that correspond to the input audio and text descriptions.
Description

CCS Concepts: Computing methodologies → Animation; Neural networks

        
@inproceedings{
10.2312:pg.20241290
, booktitle = {
Pacific Graphics Conference Papers and Posters
}, editor = {
Chen, Renjie
and
Ritschel, Tobias
and
Whiting, Emily
}, title = {{
Audio-Driven Speech Animation with Text-Guided Expression
}}, author = {
Jung, Sunjin
and
Chun, Sewhan
and
Noh, Junyong
}, year = {
2024
}, publisher = {
The Eurographics Association
}, ISBN = {
978-3-03868-250-9
}, DOI = {
10.2312/pg.20241290
} }
Citation