Filtering and Optimization Strategies for Markerless Human Motion Capture with Skeleton-Based Shape Models
Since more than 2000 years, people have been interested in understanding and analyzing themovements of animals and humans which lead to the development of advanced computer systemsfor motion capture. Although marker-based systems for motion analysis are commerciallysuccessful, capturing the performance of a human or an animal from a multi-view video sequencewithout the need for markers is still a challenging task. The most popular methods formarkerless human motion capture are model-based approaches that rely on a surface model ofthe human with an underlying skeleton. In this context, markerless motion capture seeks for thepose, i.e., the position, orientation, and configuration of the human skeleton that is best explainedby the image data. In order to address this problem, we discuss the two questions:1. What are good cues for human motion capture? Typical cues for motion capture are silhouettes,edges, color, motion, and texture. In general, a multi-cue integration is necessary fortracking complex objects like humans since all these cues come along with inherent drawbacks.Besides the selection of the cues to be combined, reasonable information fusion is a commonchallenge in many computer vision tasks. Ideally, the impact of a cue should be large in situationswhen its extraction is reliable, and small, if the information is likely to be erroneous. Tothis end, we propose an adaptive weighting scheme that combines complementary cues, namelysilhouettes on one side and optical flow as well as local descriptors on the other side. Whereassilhouette extraction works best in case of homogeneous objects, optical flow computation andlocal descriptors perform better on sufficiently structured objects. Besides image-based cues, wealso propose a statistical prior on anatomical constraints that is independent of motion patterns.Relying only on image features that are tracked over time does not prevent the accumulationof small errors which results in a drift away from the target object. The error accumulationbecomes even more problematic in the case of multiple moving objects due to occlusions. Tosolve the drift problem for tracking, we propose an analysis-by-synthesis framework that usesreference images to correct the pose. It comprises an occlusion handling and is successfullyapplied to crash test video analysis.2. Is human motion capture a filtering or an optimization problem? Model-based human motioncapture can be regarded as a filtering or an optimization problem. While local optimizationoffers accurate estimates but often looses track due to local optima, particle filtering can recoverfrom errors at the expense of a poor accuracy due to overestimation of noise. In order to overcomethe drawbacks of local optimization, we introduce a novel global stochastic optimizationapproach for markerless human motion capturing that is derived from the mathematical theoryon interacting particle systems. We call the method interacting simulated annealing (ISA) sinceit is based on an interacting particle system that converges to the global optimum similar tosimulated annealing. It estimates the human pose without initial information, which is a challengingoptimization problem in a high dimensional space. Furthermore, we propose a trackingframework that is based on this optimization technique to achieve both the robustness of filteringstrategies and a remarkable accuracy.In order to benefit from optimization and filtering, we introduce a multi-layer framework thatcombines stochastic optimization, filtering, and local optimization. While the first layer relieson interacting simulated annealing, the second layer refines the estimates by filtering and localoptimization such that the accuracy is increased and ambiguities are resolved over time withoutimposing restrictions on the dynamics.In addition, we propose a system that recovers not only the movement of the skeleton, but alsothe possibly non-rigid temporal deformation of the 3D surface. While large scale deformationsor fast movements are captured by the skeleton pose and approximate surface skinning, truesmall scale deformations or non-rigid garment motion are captured by fitting the surface to thesilhouette. In order to make automatic processing of large data sets feasible, the skeleton-basedpose estimation is split into a local one and a lower dimensional global one by exploiting thetree structure of the skeleton.Our experiments comprise a large variety of sequences for qualitative and quantitative evaluationof the proposed methods, including a comparison of global stochastic optimization withseveral other optimization and particle filtering approaches.