SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Lu, Dongyue; Liang, Ao; Huang, Tianxin; Fu, Xiao; Zhao, Yuyang; Ma, Baorui; Pan, Liang; Yin, Wei; Kong, Lingdong; Ooi, Wei Tsang; Liu, Ziwei

SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

dc.contributor.author	Lu, Dongyue
dc.contributor.author	Liang, Ao
dc.contributor.author	Huang, Tianxin
dc.contributor.author	Fu, Xiao
dc.contributor.author	Zhao, Yuyang
dc.contributor.author	Ma, Baorui
dc.contributor.author	Pan, Liang
dc.contributor.author	Yin, Wei
dc.contributor.author	Kong, Lingdong
dc.contributor.author	Ooi, Wei Tsang
dc.contributor.author	Liu, Ziwei
dc.contributor.editor	Masia, Belen
dc.contributor.editor	Thies, Justus
dc.date.accessioned	2026-04-17T11:54:16Z
dc.date.available	2026-04-17T11:54:16Z
dc.date.issued	2026
dc.description.abstract	Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for inthe-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A viewconditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
dc.description.number	2
dc.description.sectionheaders	Temporal Vision: Video Generation, Pose, and Narrative
dc.description.seriesinformation	Computer Graphics Forum
dc.description.volume	45
dc.identifier.doi	10.1111/cgf.70345
dc.identifier.issn	1467-8659
dc.identifier.pages	13 pages
dc.identifier.uri	https://diglib.eg.org/handle/10.1111/cgf70345
dc.identifier.uri	https://doi.org/10.1111/cgf.70345
dc.publisher	The Eurographics Association and John Wiley & Sons Ltd.
dc.rights	CC-BY-4.0
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	CCS Concepts: Computing methodologies → Reconstruction; Appearance and texture representations;
dc.subject	CCS Concepts
dc.subject	Computing methodologies → Reconstruction
dc.subject	Appearance and texture representations
dc.title	SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Files

Original bundle

Now showing 1 - 2 of 2

Name:: cgf70345.pdf
Size:: 17.05 MB
Format:: Adobe Portable Document Format

Download

Name:: paper1201_mm1.pdf
Size:: 33.51 MB
Format:: Adobe Portable Document Format

Download

Collections

45-Issue 2
EG 2026 - Full Papers - CGF 45-Issue 2