SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Lu, Dongyue; Liang, Ao; Huang, Tianxin; Fu, Xiao; Zhao, Yuyang; Ma, Baorui; Pan, Liang; Yin, Wei; Kong, Lingdong; Ooi, Wei Tsang; Liu, Ziwei

SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Files

cgf70345.pdf (17.05 MB)

paper1201_mm1.pdf (33.51 MB)

Date

2026

Authors

Lu, Dongyue
Liang, Ao
Huang, Tianxin
Fu, Xiao
Zhao, Yuyang
Ma, Baorui
Pan, Liang
Yin, Wei
Kong, Lingdong
Ooi, Wei Tsang
Liu, Ziwei

Publisher

The Eurographics Association and John Wiley & Sons Ltd.

Abstract

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for inthe-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A viewconditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

CCS Concepts: Computing methodologies → Reconstruction; Appearance and texture representations;

        @article{10.1111:cgf.70345
,
journal = {Computer Graphics Forum},
title = {{SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
}},
author = {Lu, Dongyue and 
Liang, Ao and 
Liu, Ziwei and 
Huang, Tianxin and 
Fu, Xiao and 
Zhao, Yuyang and 
Ma, Baorui and 
Pan, Liang and 
Yin, Wei and 
Kong, Lingdong and 
Ooi, Wei Tsang
},
year = {2026
},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.
},
ISSN = {1467-8659
},
DOI = {10.1111/cgf.70345
}
}

URI

https://diglib.eg.org/handle/10.1111/cgf70345
https://doi.org/10.1111/cgf.70345

Collections

45-Issue 2
EG 2026 - Full Papers - CGF 45-Issue 2

Full item page