High-Performance Graphics 2010
Permanent URI for this collection
Browse
Browsing High-Performance Graphics 2010 by Issue Date
Now showing 1 - 19 of 19
Results Per Page
Sort Options
Item A Lazy Object-Space Shading Architecture With Decoupled Sampling(The Eurographics Association, 2010) Burns, Christopher A.; Fatahalian, Kayvon; Mark, William R.; Michael Doggett and Samuli Laine and Warren HuntWe modify the Reyes object-space shading approach to address two inefficiencies that result from performing shading calculations at micropolygon grid vertices prior to rasterization. Our system samples shading of surface sub-patches uniformly in the object s parametric domain, but the location of shading samples need not correspond with the location of mesh vertices. Thus we perform object-space shading that efficiencly supports motion and defocus blur, but do not require micropolygons to achieve a shading rate of one sample per pixel. Second, our system resolves surface visibility prior to shading, then lazily shades 2x2 sample blocks that are known to contribute to the resulting fragments. We find that in comparison to a Reyes micropolygon rendering pipeline, decoupling geometric sampling rate from shading rate permits the use of meshes containing an order of magnitude fewer vertices with minimal loss of image quality in our test scenes. Shading on-demand after rasterization reduces shader invocations by over two times in comparison to pre-visibility object-space shading.Item Analytical Motion Blur Rasterization with Compression(The Eurographics Association, 2010) Gribel, Carl Johan; Doggett, Michael; Akenine-Möller, Tomas; Michael Doggett and Samuli Laine and Warren HuntWe present a rasterizer, based on time-dependent edge equations, that computes analytical visibility in order to render accurate motion blur. The theory for doing the computations in a rasterization framework is derived in detail, and then implemented. To keep the frame buffer requirements low, we also present a new oracle-based compression algorithm for the time intervals. Our results are promising in that high quality motion blurred scenes can be rendered using a rasterizer with rather low memory requirements. Our resulting images contain motion blur for both opaque and transparent objects.Item Texture Compression of Light Maps using Smooth Profile Functions(The Eurographics Association, 2010) Rasmusson, Jim; Ström, Jacob; Wennersten, Per; Doggett, Michael; Akenine-Möller, Tomas; Michael Doggett and Samuli Laine and Warren HuntLight maps have long been a popular technique for visually rich real-time rendering in games. They typically contain smooth color gradients which current low bit rate texture compression techniques, such as DXT1 and ETC2, do not handle well. The application writer must therefore choose between doubling the bit rate by choosing a codec such as BC7, or accept the compression artifacts, neither of which is desirable. The situation is aggravated by the recent popularity of radiosity normal maps, where three light maps plus a normal map are used for each surface. We present a new texture compression algorithm targeting smoothly varying textures, such as the light maps used in radiosity normal mapping. On high-resolution light map data from real games, the proposed method shows quality improvements of 0.7 dB in PSNR over ETC2, and 2.8 dB over DXT1, for the same bit rate. As a side effect, our codec can also compress many standard images (not light maps) with better quality than DXT1/ETC2.Item Large Data Visualization on Distributed Memory Multi-GPU Clusters(The Eurographics Association, 2010) Fogal, Thomas; Childs, Hank; Shankar, Siddharth; Krüger, Jens; Bergeron, R. Daniel; Hatcher, Philip; Michael Doggett and Samuli Laine and Warren HuntData sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualized on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.Item Real-time Stochastic Rasterization on Conventional GPU Architectures(The Eurographics Association, 2010) McGuire, Morgan; Enderton, Eric; Shirley, Peter; Luebke, David; Michael Doggett and Samuli Laine and Warren HuntThe paper presents a hybrid algorithm for rendering approximate motion and defocus blur with precise stochastic visibility evaluation. It demonstrates-for the first time, with a full stochastic technique-real-time performance on conventional GPU architectures for complex scenes at 1920x1080 HD resolution. The algorithm operates on dynamic triangle meshes for which per-vertex velocity or corresponding vertices from the previous frame are available. It leverages multisample antialiasing (MSAA) and a tight space-time-aperture convex hull to efficiently evaluate visibility independently of shading. For triangles that cross zItem Hardware Implementation of Micropolygon Rasterization with Motion and Defocus Blur(The Eurographics Association, 2010) Brunhaver, John S.; Fatahalian, Kayvon; Hanrahan, Pat; Michael Doggett and Samuli Laine and Warren HuntCurrent GPUs rasterize micropolygons (polygons approximately one pixel in size) inefficiently. Additionally, they do not natively support triangle rasterization with jittered sampling, defocus, or motion blur. We perform a microarchitectural study of fixed-function micropolygon rasterization using custom circuits. We present three rasterization designs: the first optimized for triangle micropolygons that are not blurred, a second for stochastic rasterization of micropolygons with motion and defocus blur, and third that is a hybrid combination of the two. Our designs achieve high area and power efficiency by using low-precision operations and rasterizing pairs of adjacent triangles in parallel. We demonstrate optimized designs synthesized in a 45 nm process showing that a micropolygon rasterization unit with a throughput of 3 billion micropolygons per second would consume 2.9 W and occupy 4.1 mm2 which is 0.77 percent of the die area of a GeForce GTX 480 GPU.Item Architecture Considerations for Tracing Incoherent Rays(The Eurographics Association, 2010) Aila, Timo; Karras, Tero; Michael Doggett and Samuli Laine and Warren HuntThis paper proposes a massively parallel hardware architecture for efficient tracing of incoherent rays, e.g. for global illumination. The general approach is centered around hierarchical treelet subdivision of the acceleration structure and repeated queueing/postponing of rays to reduce cache pressure. We describe a heuristic algorithm for determining the treelet subdivision, and show that our architecture can reduce the total memory bandwidth requirements by up to 90% in difficult scenes. Furthermore the architecture allows submitting rays in an arbitrary order with practically no performance penalty.We also conclude that scheduling algorithms can have an important effect on results, and that using fixed-size queues is not an appealing design choice. Increased auxiliary traffic, including traversal stacks, is identified as the foremost remaining challenge of this architecture.Item A Work-Efficient GPU Algorithm for Level Set Segmentation(The Eurographics Association, 2010) Roberts, Mike; Packer, Jeff; Sousa, Mario Costa; Mitchell, Joseph Ross; Michael Doggett and Samuli Laine and Warren HuntWe present a novel GPU level set segmentation algorithm that is both work-efficient and step-efficient. Our algorithm: (1) has linear work-complexity and logarithmic step-complexity, both of which depend only on the size of the active computational domain and do not depend on the size of the level set field; (2) limits the active computational domain to the minimal set of changing elements by examining both the temporal and spatial derivatives of the level set field; (3) tracks the active computational domain at the granularity of individual level set field elements instead of tiles without performance penalty; and (4) employs a novel parallel method for removing duplicate elements from unsorted data streams in a constant number of steps. We apply our algorithm to 3D medical images and we demonstrate that in typical clinical scenarios, our algorithm reduces the total number of processed level set field elements by 16x and is 14x faster than previous GPU algorithms with no reduction in segmentation accuracy.Item Parallel SAH k-D Tree Construction(The Eurographics Association, 2010) Choi, Byn; Komuravelli, Rakesh; Lu, Victor; Sung, Hyojin; Bocchino, Robert L.; Adve, Sarita V.; Hart, John C.; Michael Doggett and Samuli Laine and Warren HuntThe k-D tree is a well-studied acceleration data structure for ray tracing. It is used to organize primitives in a scene to allow efficient execution of intersection operations between rays and the primitives. The highest quality k-D tree can be obtained using greedy cost optimization based on a surface area heuristc (SAH). While the high quality enables very fast ray tracing times, a key drawback is that the k-D tree construction time remains prohibitively expensive. This cost is unreasonable for rendering dynamic scenes for future visual computing applications on emerging multicore systems. Much work has therefore been focused on faster parallel k-D tree construction performance at the expense of approximating or ignoring SAH computation, which produces k-D trees that degrade rendering time. In this paper, we present two new parallel algorithms for building precise SAH-optimized k-D trees, with different tradeoffs between the total work done and parallel scalability. The algorithms achieve up to 8x speedup on 32 cores, without degrading tree quality and rendering time, yielding the best reported speedups so far for precise-SAH k-D tree construction.Item AnySL: Efficient and Portable Shading for Ray Tracing(The Eurographics Association, 2010) Karrenberg, Ralf; Rubinstein, Dmitri; Slusallek, Philipp; Hack, Sebastian; Michael Doggett and Samuli Laine and Warren HuntWhile a number of different shading languages have been developed, their efficient integration into an existing renderer is notoriously difficult, often boiling down to implementing an entire compiler toolchain for each language. Furthermore, no shading language is broadly supported across the variety of rendering systems. AnySL attacks this issue from multiple directions: We compile shaders from different languages into a common, portable representation, which uses subroutine threaded code: Every language operator is translated to a function call. Thus, the compiled shader is generic with respect to the used types and operators. The key component of our system is an embedded compiler that instantiates this generic code in terms of the renderer's native types and operations. It allows for flexible code transformations to match the internal structure of the renderer and eliminates all overhead due to the subroutine threaded code. For SIMD architectures we automatically perform vectorization of scalar shaders which speeds up rendering by a factor of 3.9 on average on SSE. The results are highly optimized, parallel shaders that operate directly on the internal data structures of a renderer. We show that both traditional shading languages such as RenderMan, but also C/C++-based shading languages, can be fully supported and deliver high performance across different CPU renderers.Item Edge-Avoiding À-TrousWavelet Transform for fast Global Illumination Filtering(The Eurographics Association, 2010) Dammertz, Holger; Sewtz, Daniel; Hanika, Johannes; Lensch, Hendrik P. A.; Michael Doggett and Samuli Laine and Warren HuntWe present a fast and simple filtering method designed for ray traced Monte Carlo global illumination images which achieves real-time rates. Even on modern hardware only few samples can be traced for interactive applications, resulting in very noisy outputs. Taking advantage of the fact that Monte Carlo computes hemispherical integrals that may be very similar for neighboring pixels we derive a fast edge-avoiding filtering method in screen space using the À-Trous wavelet transform that operates on the full noisy image and produces a result that is close to a solution with many more samples per pixel.Item Real Time Volumetric Shadows using Polygonal Light Volumes(The Eurographics Association, 2010) Billeter, Markus; Sintorn, Erik; Assarsson, Ulf; Michael Doggett and Samuli Laine and Warren HuntThis paper presents a more efficient way of computing single scattering effects in homogeneous participating media for real-time purposes than the currently popular ray-marching based algorithms. These effects include halos around light sources, volumetric shadows and crepuscular rays. By displacing the vertices of a base mesh with the depths from a standard shadow map, we construct a polygonal mesh that encloses the volume of space that is directly illuminated by a light source. Using this volume we can calculate the airlight contribution for each pixel by considering only points along the eye-ray where shadow-transitions occur. Unlike previous ray-marching methods, our method calculates the exact airlight contribution, with respect to the shadow map resolution, at real time frame rates.Item GPU Random Numbers via the Tiny Encryption Algorithm(The Eurographics Association, 2010) Zafar, Fahad; Olano, Marc; Curtis, Aaron; Michael Doggett and Samuli Laine and Warren HuntRandom numbers are extensively used on the GPU. As more computation is ported to the GPU, it can no longer be treated as rendering hardware alone. Random number generators (RNG) are expected to cater general purpose and graphics applications alike. Such diversity adds to expected requirements of a RNG. A good GPU RNG should be able to provide repeatability, random access, multiple independent streams, speed, and random numbers free from detectable statistical bias. A specific application may require some if not all of the above characteristics at one time. In particular, we hypothesize that not all algorithms need the highest-quality random numbers, so a good GPU RNG should provide a speed quality tradeoff that can be tuned for fast low quality or slower high quality random numbers. We propose that the Tiny Encryption Algorithm satisfies all of the requirements of a good GPU Pseudo Random Number Generator. We compare our technique against previous approaches, and present an evaluation using standard randomness test suites as well as Perlin noise and a Monte-Carlo shadow algorithm. We show that the quality of random number generation directly affects the quality of the noise produced, however, good quality noise can still be produced with a lower quality random number generator.Item HLBVH: Hierarchical LBVH Construction for Real-Time Ray Tracing of Dynamic Geometry(The Eurographics Association, 2010) Pantaleoni, Jacopo; Luebke, David; Michael Doggett and Samuli Laine and Warren HuntWe present HLBVH and SAH-optimized HLBVH, two high performance BVH construction algorithms targeting real-time ray tracing of dynamic geometry. HLBVH provides a novel hierarchical formulation of the LBVH algorithm [LGS-09] and SAH-optimized HLBVH uses a new combination of HLBVH and the greedy surface area heuristic algorithm. These algorithms minimize work and memory bandwidth usage by extracting and exploiting coarse-grained spatial coherence already available in the input meshes. As such, they are well-suited for sorting dynamic geometry, in which the mesh to be sorted at a given time step can be defined as a transformation of a mesh that has been already sorted at the previous time step. Our algorithms always perform full resorting, unlike previous approaches based on refitting. As a result they remain efficient even during chaotic and discontinuous transformations, such as fracture or explosion.Item Ambient Occlusion Volumes(The Eurographics Association, 2010) McGuire, Morgan; Michael Doggett and Samuli Laine and Warren HuntThis paper introduces a new approximation algorithm for the near-field ambient occlusion problem. It combines known pieces in a new way to achieve substantially improved quality over fast methods and substantially improved performance compared to accurate methods. Intuitively, it computes the analog of a shadow volume for ambient light around each polygon, and then applies a tunable occlusion function within the region it encloses. The algorithm operates on dynamic triangle meshes and produces output that is comparable to ray traced occlusion for many scenes. The algorithm's performance on modern GPUs is largely independent of geometric complexity and is dominated by fill rate, as is the case with most deferred shading algorithms.Item Space-Time Hierarchical Occlusion Culling for Micropolygon Rendering with Motion Blur(The Eurographics Association, 2010) Boulos, Solomon; Luong, Edward; Fatahalian, Kayvon; Moreton, Henry; Hanrahan, Pat; Michael Doggett and Samuli Laine and Warren HuntOcclusion culling using a traditional hierarchical depth buffer, or z-pyramid, is less effective when rendering with motion blur. We present a new data structure, the tz-pyramid, that extends the traditional z-pyramid to represent scene depth values in time. This temporal information improves culling efficacy when rendering with motion blur. The tz-pyramid allows occlusion culling to adapt to the amount of scene motion, providing a balance of high efficacy with large motion and low cost in terms of depth comparisons when motion is small. Compared to a traditional z-pyramid, using the tz-pyramid for occlusion culling reduces the number of micropolygons shaded by up to 3.5x. In addition to better culling, the tz-pyramid reduces the number of depth comparisons by up to 1.4x.Item Efficient Bounding of Displaced Bézier Patches(The Eurographics Association, 2010) Munkberg, Jacob; Hasselgren, Jon; Toth, Robert; Akenine-Möller, Tomas; Michael Doggett and Samuli Laine and Warren HuntIn this paper, we present a new approach to conservative bounding of displaced Bézier patches. These surfaces are expected to be a common use case for tessellation in interactive and real-time rendering. Our algorithm combines efficient normal bounding techniques, min-max mipmap hierarchies and oriented bounding boxes. This results in substantially faster convergence for the bounding volumes of displaced surfaces, prior to tessellation and displacement shading. Our work can be used for different types of culling, ray tracing, and to sort higher order primitives in tiling architectures. For our hull shader implementation, we report performance benefits even for moderate tessellation rates.Item Restart Trail for Stackless BVH Traversal(The Eurographics Association, 2010) Laine, Samuli; Michael Doggett and Samuli Laine and Warren HuntA ray cast algorithm utilizing a hierarchical acceleration structure needs to perform a tree traversal in the hierarchy. In its basic form, executing the traversal requires a stack that holds the nodes that are still to be processed. In some cases, such a stack can be prohibitively expensive to maintain or access, due to storage or memory bandwidth limitations. The stack can, however, be eliminated or replaced with a fixed-size buffer using so-called stackless or short stack algorithms. These require that the traversal can be restarted from root so that the already processed part of the tree is not entered again. For kd-tree ray casts, this is accomplished easily by ray shortening, but the approach does not extend to other kinds of hierarchies such as BVHs. In this paper, we introduce restart trail, a simple algorithmic method that makes restarts possible regardless of the type of hierarchy by storing one bit of data per level. This enables stackless and short stack traversal for BVH ray casts, where using a full stack or constraining the traversal order have so far been the only options.Item Task Management for Irregular-Parallel Workloads on the GPU(The Eurographics Association, 2010) Tzeng, Stanley; Patney, Anjul; Owens, John D.; Michael Doggett and Samuli Laine and Warren HuntWe explore software mechanisms for managing irregular tasks on graphics processing units (GPUs). We demonstrate that dynamic scheduling and efficient memory management are critical problems in achieving high efficiency on irregular workloads. We experiment with several task-management techniques, ranging from the use of a single monolithic task queue to distributed queuing with task stealing and donation. On irregular workloads, we show that both centralized and distributed queues have more than 100 times as much idle times as our task-stealing and -donation queues. Our preferred choice is task-donation because of comparable performance to task-stealing while using less memory overhead. To help in this analysis, we use an artificial task-management system that monitors performance and memory usage to quantify the impact of these different techniques. We validate our results by implementing a Reyes renderer with its irregular split-and-dice workload that is able to achieve real-time framerates on a single GPU.