MultiCOIN: Multi-Modal COntrollable INbetweening

dc.contributor.authorTanveer, Maham
dc.contributor.authorZhou, Yang
dc.contributor.authorNiklaus, Simon
dc.contributor.authorMahdavi Amiri, Ali
dc.contributor.authorZhang, Hao (Richard)
dc.contributor.authorSingh, Krishna Kumar
dc.contributor.authorZhao, Nanxuan
dc.contributor.editorMasia, Belen
dc.contributor.editorThies, Justus
dc.date.accessioned2026-04-17T12:13:06Z
dc.date.available2026-04-17T12:13:06Z
dc.date.issued2026
dc.description.abstractVideo inbetweening creates smooth transitions between two frames making it an indispensable tool for video editing and longform video synthesis. Existing methods struggle with large or complex motion and offer limited control over intermediate frames, often misaligning with user intent. We introduce MultiCOIN, a video inbetweening framework supporting multi-modal controls, including depth transitions and layering, motion trajectories, text prompts, and target regions for movement localization. It balances flexibility, usability, and fine-grained precision. Built on a Diffusion Transformer (DiT), due to its proven capability to generate high-quality long video, our model maps all motion controls into a unified sparse point-based representation compatible with the denoising process. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content and motion into two branches, enabling dedicated generators for each. A stage-wise training strategy ensures stable learning of multi-modal controls. Extensive experiments show improved motion complexity, controllability, and narrative consistency. Project Page: MultiCOIN.
dc.description.number2
dc.description.sectionheadersTemporal Vision: Video Generation, Pose, and Narrative
dc.description.seriesinformationComputer Graphics Forum
dc.description.volume45
dc.identifier.doi10.1111/cgf.70362
dc.identifier.issn1467-8659
dc.identifier.pages11 pages
dc.identifier.urihttps://diglib.eg.org/handle/10.1111/cgf70362
dc.identifier.urihttps://doi.org/10.1111/cgf.70362
dc.publisherThe Eurographics Association and John Wiley & Sons Ltd.
dc.rightsCC-BY-4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectComputer vision tasks
dc.titleMultiCOIN: Multi-Modal COntrollable INbetweening
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
cgf70362.pdf
Size:
2.6 MB
Format:
Adobe Portable Document Format