MultiCOIN: Multi-Modal COntrollable INbetweening

Tanveer, Maham; Zhou, Yang; Niklaus, Simon; Mahdavi Amiri, Ali; Zhang, Hao (Richard); Singh, Krishna Kumar; Zhao, Nanxuan

MultiCOIN: Multi-Modal COntrollable INbetweening

dc.contributor.author	Tanveer, Maham
dc.contributor.author	Zhou, Yang
dc.contributor.author	Niklaus, Simon
dc.contributor.author	Mahdavi Amiri, Ali
dc.contributor.author	Zhang, Hao (Richard)
dc.contributor.author	Singh, Krishna Kumar
dc.contributor.author	Zhao, Nanxuan
dc.contributor.editor	Masia, Belen
dc.contributor.editor	Thies, Justus
dc.date.accessioned	2026-04-17T12:13:06Z
dc.date.available	2026-04-17T12:13:06Z
dc.date.issued	2026
dc.description.abstract	Video inbetweening creates smooth transitions between two frames making it an indispensable tool for video editing and longform video synthesis. Existing methods struggle with large or complex motion and offer limited control over intermediate frames, often misaligning with user intent. We introduce MultiCOIN, a video inbetweening framework supporting multi-modal controls, including depth transitions and layering, motion trajectories, text prompts, and target regions for movement localization. It balances flexibility, usability, and fine-grained precision. Built on a Diffusion Transformer (DiT), due to its proven capability to generate high-quality long video, our model maps all motion controls into a unified sparse point-based representation compatible with the denoising process. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content and motion into two branches, enabling dedicated generators for each. A stage-wise training strategy ensures stable learning of multi-modal controls. Extensive experiments show improved motion complexity, controllability, and narrative consistency. Project Page: MultiCOIN.
dc.description.number	2
dc.description.sectionheaders	Temporal Vision: Video Generation, Pose, and Narrative
dc.description.seriesinformation	Computer Graphics Forum
dc.description.volume	45
dc.identifier.doi	10.1111/cgf.70362
dc.identifier.issn	1467-8659
dc.identifier.pages	11 pages
dc.identifier.uri	https://diglib.eg.org/handle/10.1111/cgf70362
dc.identifier.uri	https://doi.org/10.1111/cgf.70362
dc.publisher	The Eurographics Association and John Wiley & Sons Ltd.
dc.rights	CC-BY-4.0
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Computer vision tasks
dc.title	MultiCOIN: Multi-Modal COntrollable INbetweening

Files

Original bundle

Now showing 1 - 1 of 1

Name:: cgf70362.pdf
Size:: 2.6 MB
Format:: Adobe Portable Document Format

Download

Collections

45-Issue 2
EG 2026 - Full Papers - CGF 45-Issue 2