Jung, SunjinChun, SewhanNoh, JunyongChen, RenjieRitschel, TobiasWhiting, Emily2024-10-132024-10-132024978-3-03868-250-9https://doi.org/10.2312/pg.20241290https://diglib.eg.org/handle/10.2312/pg20241290We introduce a novel method for generating expressive speech animations of a 3D face, driven by both audio and text descriptions. Many previous approaches focused on generating facial expressions using pre-defined emotion categories. In contrast, our method is capable of generating facial expressions from text descriptions unseen during training, without limitations to specific emotion classes. Our system employs a two-stage approach. In the first stage, an auto-encoder is trained to disentangle content and expression features from facial animations. In the second stage, two transformer-based networks predict the content and expression features from audio and text inputs, respectively. These features are then passed to the decoder of the pre-trained auto-encoder, yielding the final expressive speech animation. By accommodating diverse forms of natural language, such as emotion words or detailed facial expression descriptions, our method offers an intuitive and versatile way to generate expressive speech animations. Extensive quantitative and qualitative evaluations, including a user study, demonstrate that our method can produce natural expressive speech animations that correspond to the input audio and text descriptions.Attribution 4.0 International LicenseCCS Concepts: Computing methodologies → Animation; Neural networksComputing methodologies → AnimationNeural networksAudio-Driven Speech Animation with Text-Guided Expression10.2312/pg.2024129012 pages