Eurographics Digital Library
This is the DSpace 7 platform of the Eurographics Digital Library.
- The contents of the Eurographics Digital Library Archive are freely accessible. Only access to the full-text documents of the journal Computer Graphics Forum (joint property of Wiley and Eurographics) is restricted to Eurographics members, people from institutions who have an Institutional Membership at Eurographics, or users of the TIB Hannover. On the item pages you will find so-called purchase links to the TIB Hannover.
- As a Eurographics member, you can log in with your email address and password from https://services.eg.org. If you are part of an institutional member and you are on a computer with a Eurographics registered IP domain, you can proceed immediately.
- From 2022, all new releases published by Eurographics will be licensed under Creative Commons. Publishing with Eurographics is Plan-S compliant. Please visit Eurographics Licensing and Open Access Policy for more details.
Recent Submissions
Denoising Monte Carlo Renderings: a Sub-Pixel Exploration with Deep Learning
(ETH Zurich, 2024-05-29) Zhang, Xianyao
Monte Carlo rendering techniques, exemplified by path tracing, are able to faithfully capture the interaction between light and objects. Therefore, they have become the primary means for visual effects and animation films to rendering digital assets into frames. However, Monte Carlo rendering techniques require stochastic sampling within each pixel to estimate the pixel color, leading to slow convergence and the choice between high rendering cost and noisy images.
Fortunately, the similarity of neighboring noisy pixels can be exploited to create much cleaner images. Such denoising techniques reduce the rendering budget significantly without sacrificing quality, and are crucial for the application of Monte Carlo rendering to production. One particular reason for the success of Monte Carlo denoisers, and their biggest difference from natural image denoisers, is the flexibility to use sub-pixel information. That is, based on the scene and the application scenario, the renderer can be instructed to output more data than simply the noisy pixel color. The data can contain estimates of a part of light transport or describe properties of the underlying scene. Such additional information can guide the denoiser to better preserve details, remove noise, or serve downstream workflows.
In this thesis, we design denoising algorithms for Monte Carlo renderings by applying deep learning techniques in the sub-pixel domain, in the aspects of light transport decomposition, auxiliary feature buffers, and intra-pixel depth separation. Our work mainly targets high-quality offline renderings, and we validate the effectiveness of the methods on both academic and production-quality datasets. First, inspired by user-defined decomposition such as diffuse–specular, we propose to prepend a learned decomposition module to the denoiser, where the learned decomposition typically produces images that are easier to denoise. Results show that this architecture outperforms an end-to-end denoiser with a similar number of trainable parameters, achieving significant rendering cost reduction to reach equal quality. Second, the power of auxiliary feature buffers for denoising prompts us to explore the appropriate feature sets for denoising volumetric effects. Our training–selection–retraining workflow sifts useful features from a large pool of candidates at a relatively low cost. Feature sets produced by this workflow improve denoising quality for denoisers with different architectures on a variety of volumetric effects. Finally, depth separation within each pixel underlies the deep-Z format, which is useful for compositing but lacks an effective denoiser that preserves the depth structure. We propose a neural denoiser for deep-Z images based on 3-D convolutional neural networks, which can effectively remove noise at different depth levels, greatly reducing the rendering cost for deep compositing workflows.
The main contributions of the thesis are as follows. For one, we propose novel denoising techniques based on deep learning in the sub-pixel domain or using sub-pixel information, and experimentally show that they advance the state of the art of denoising Monte Carlo renderings. For another, we demonstrate the benefit of using specialized sub-pixel information for more specific types of rendering, such as volumetric effects. Lastly, we show the possibility of generalizing 2-D deeplearning denoising techniques to deep-Z images while preserving the sub-pixel depth structure.
Robust Deep Learning-based Methods for Non-Rigid Shape Correspondence
(Ecole Polytechnique, 2024-10-10) Attaiki, Souhaib
The automatic processing and analysis of 3D shapes is a critical area of research with significant implications for fields such as medical imaging, virtual reality, and computer graphics. A primary challenge in this domain is the efficient comparison of non-rigid shapes, which involves establishing correspondences between surfaces undergoing complex deformations. This dissertation enhances the state-of-the-art in non-rigid shape matching by leveraging deep learning within the functional map framework. Previous deep functional map methods struggle with partial shapes, decoding the information in their probe functions, and utilizing large-scale datasets for pretraining, among other issues. To overcome these obstacles, our work contributes five significant advancements to the field of deep functional maps. First, we introduce Deep Partial Functional Maps (DPFM), a novel architecture that enhances communication between source and target shapes, particularly adept at handling partial shapes with non-rigid deformations. Second, we present the Neural Correspondence Prior (NCP), which employs neural networks as a prior to propose a general, unsupervised method for shape matching, especially suitable for sparse and non-isometric data. Additionally, we analyze the features learned through deep functional maps and suggest straightforward modifications to the pipeline that extend the utility of these features beyond their traditional roles. Furthermore, we tackle the challenge of input feature robustness by exploring the pre-training of generalizable local features on large datasets of rigid shapes, thus boosting performance on non-rigid shape analysis tasks. We also introduce a zero-shot method for non-rigid shape matching that operates independently of any pretraining steps or datasets. Together, these innovations provide robust and efficient solutions for non-rigid shape matching, addressing long-standing challenges and broadening the application of these methods to diverse real-world datasets and applications.
Situated Visualization in Motion
(Université Paris-Saclay, 2023-12-18) Yao, Lijie
In my thesis, I define visualization in motion and make several contributions to how to visualize and design situated visualizations in motion. In situated data visualization, the data is directly visualized near their data referent, i.e., the physical space, object, or person it refers to. Situated visualizations are often useful in contexts where the data referent or the viewer does not remain stationary but is in relative motion. For example, a runner is looking at visualizations from their fitness band while running or from a public display as they are passing it by. Reading visualizations in such scenarios might be impacted by motion factors. As such, understanding how to best design visualizations for dynamic contexts is important. That is, effective and visually stable situated data encodings need to be defined and studied when motion factors are involved. As such, I first define visualization in motion as visual data representations used in contexts that exhibit relative motion between a viewer and an entire visualization. I classify visualization in motion into 3 categories: (a) moving viewer & stationary visualization, (b) moving visualization & stationary viewer, and (c) moving viewer & moving visualization. To analyze the opportunities and challenges of designing visualization in motion, I propose a research agenda. To explore to what extent viewers can accurately read visualization in motion, I conduct a series of empirical perception studies on magnitude proportion estimation. My results show that people can get reliable information from visualization in motion, even if at high speed and under irregular trajectories. Based on my perception results, I move toward answering the question of how to design and embed visualization in motion in real contexts. I pick up swimming as an application scenario because swimming has rich, dynamic data. I implement a technology probe that allows users to embed visualizations in motion in a live swimming video. Users can adjust in real-time visual encoding parameters, the movement status, and the situatedness of visualization. The visualizations encode real swimming race-related data. My evaluation with designers confirms that designing visualizations in motion requires more than what traditional visualization toolkits provide: the visualization needs to be placed in-context (e.g., its data referent, its background) but also needs to be previewed under its real movement. The full context with motion effects can affect design decisions. After that, I continue my work to understand the impact of the context on the design of visualizations in motion and its user experience. I select video games as my test platform, in which visualizations in motion are placed in a busy, dynamic background but need to help players make quick decisions to win. My study shows there are trade-offs between visualization's readability under motion and aesthetics. Participants seek a balance between the readability of visualization, the aesthetic fitting to the context, the immersion experience the visualization brings, the support the visualization can provide for a win, and the harmony between the visualization and its context.
Learning Digital Humans from Vision and Language
(ETH Zurich, 2024-10-10) Yao Feng
The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and machine learning. This growing interest is driven by the importance of understanding human selves and the pivotal role digital humans play in diverse applications, including virtual presence in AR/VR, digital fashion, entertainment, robotics, and healthcare. However, two major challenges hinder the widespread use of digital humans across disciplines: the difficulty in capturing, as current methods rely on complex systems that are time-consuming, labor-intensive, and costly; and the lack of understanding, where even after creating digital humans, gaps in understanding their 3D representations and integrating them with broader world knowledge limit their effective utilization. Overcoming these challenges is crucial to unlocking the full potential of digital humans in interdisciplinary research and practical applications.
To address these challenges, this thesis combines insights from computer vision, computer graphics, and machine learning to \textbf{develop scalable methods for capturing and modeling digital humans}. These methods include capturing faces, bodies, hands, hair, and clothing using accessible data such as images, videos, and text descriptions. More importantly, \textbf{we go beyond capturing to shift the research paradigm toward understanding and reasoning} by leveraging large language models (LLMs). For instance, we developed the first foundation model that not only captures 3D human poses from a single image, but also reasons about a person’s potential next actions in 3D by incorporating world knowledge. This thesis unifies scalable capturing and understanding of digital humans, from vision and language data—just as humans do by observing and interpreting the world through visual and linguistic information.
Our research begins by developing a framework to capture detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape reconstruction. It effectively disentangles identity and expression details, thereby allowing animation of estimated faces with various expressions.
Humans, are not just faces, we then develop PIXIE, a method for estimating animatable, whole-body 3D avatars with realistic facial details from a single image. By incorporating an attention mechanism, PIXIE surpasses previous methods in accuracy and enables the creation of expressive, high-quality 3D humans.
Expanding beyond human bodies, we proposed SCARF and DELTA, to capture separate body, clothing, face, and hair from monocular videos using a hybrid representation. While clothing and hair are better modeled with implicit representations like neural radiance fields (NeRFs) due to their complex topologies, human bodies are better represented with meshes. SCARF combines the strengths of both by integrating mesh-based bodies with NeRFs for clothing and hair. To enable learning directly from monocular videos, we introduced mesh-integrated volume rendering, which enables optimizing the model directly from 2D image data without requiring 3D supervision. Thanks to the disentangled modeling, the captured avatar's clothing can be transferred to arbitrary body shapes, making it especially valuable for applications such as virtual try-on. Building on SCARF's hybrid representation, we introduced TECA, which uses text-to-image generation models to create realistic and editable 3D avatars. TECA produces more realistic avatars than recent methods while allowing edits due to its compositional design. For instance, users can input descriptions like ``a slim woman with dreadlocks'' to generate a 3D head mesh with texture and a NeRF model for the hair. It also enables transferring NeRF-based hairstyles, scarves, and other accessories between avatars. While these methods make capturing humans more accessible, broader applications require understanding the context of human behavior. Traditional pose estimation methods often isolate subjects by cropping images, which limits their ability to interpret the full scene or reason about actions.
To address this, we developed ChatPose, the first model for understanding and reasoning about 3D human poses. ChatPose leverages a multimodal large language model (LLM), finetuned a projection layer to decode embeddings into 3D pose parameters, which are further decoded into 3D body meshes using the SMPL body model. By finetuning on both text-to-3D pose and image-to-3D pose data, ChatPose demonstrates, for the first time, that a LLM can directly reason about 3D human poses. This capability allows ChatPose to describe human behavior, generate 3D poses, and reason about potential next actions in 3D form, combining perception with reasoning.
We believe the contributions of this thesis, in scaling up digital human capture and advancing the understanding of humans in 3D, have the potential to shape the future of human-centered research and enable broader applications across diverse fields.
Interaction in Virtual Reality Simulations
(Politecnico di Torino, 2023-07-18) Calandra, Davide
Virtual Reality (VR) has emerged as a powerful technology for creating immersive and engaging simulations that enable users to interact with computer-generated environments in a natural and intuitive way. However, the design and implementation of effective interaction methods in VR remain challenging. The lack of proper haptic feedback, and the need to rely on input devices such as controllers or gestures, for example, can result in awkward or unnatural interactions, reducing the perceived level of realism and the immersion related to the VR experience. At the same time, the employment of poorly designed interaction paradigms may impair usability, reduce the the sense of presence, and even cause unpleasant effects related to the so called cybersickness.
This doctoral thesis, which covers a subset of the the research work performed in the three-year Ph.D. period, aims to address these challenges by investigating the role of interaction in VR simulations. The investigated topics range from the study of locomotion interfaces in VR, to the use of haptic interfaces for simulating passive and haptic tools applied to real life training use cases and the exploration of further forms of Human-Computer Interaction (HCI) and Human-Human Interaction (HHI) through voice and body gestures, also in the context of multi-user shared simulations.
Results obtained in the considered case studies cover a wide number of relevant aspects, such as realism, usability, and engagement of VR simulations, among others, ultimately leading to a validation of proposed approaches and methodologies.In this way, the thesis contributes to the understanding of how to design and evaluate interaction paradigms in VR simulations in order to enhance aspects related to User eXperience (UX), with the goal of letting users successfully achieve the intended simulation objectives.