Real-time 3D Human Body Pose Estimation from Monocular RGB Input
Human motion capture finds extensive application in movies, games, sports and biomechanical analysis. However, existing motion capture solutions require cumbersome external and/or on-body instrumentation, or use active sensors with limits on the possible capture volume dictated by power consumption. The ubiquity and ease of deployment of RGB cameras makes monocular RGB based human motion capture an extremely useful problem to solve, which would lower the barrier-to entry for content creators to employ motion capture tools, and enable newer applications of human motion capture. This thesis demonstrates the first real-time monocular RGB based motion-capture solutions that work in general scene settings. They are based on developing neural network based approaches to address the ill-posed problem of estimating 3D human pose from a single RGB image, in combination with model based fitting. In particular, the contributions of this work make advances towards three key aspects of real-time monocular RGB based motion capture, namely speed, accuracy, and the ability to work for general scenes. New training datasets are proposed, for single-person and multi-person scenarios, which, together with the proposed transfer learning based training pipeline, allow learning based approaches to be appearance invariant. The training datasets are accompanied by evaluation benchmarks with multiple avenues of fine-grained evaluation. The evaluation benchmarks differ visually from the training datasets, so as to promote efforts towards solutions that generalize to in-the-wild scenes. The proposed task formulations for the single-person and multi-person case allow higher accuracy, and incorporate additional qualities such as occlusion robustness, that are helpful in the context of a full motion capture solution. The multi-person formulations are designed to have a nearly constant inference time regardless of the number of subjects in the scene, and combined with contributions towards fast neural network inference, enable real-time 3D pose estimation for multiple subjects. Combining the proposed learning-based approaches with a model-based kinematic skeleton fitting step provides temporally stable joint angle estimates, which can be readily employed for driving virtual characters.