10 results
Search Results
Now showing 1 - 10 of 10
Item Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication(The Eurographics Association, 2004) Fatahalian, K.; Sugerman, J.; Hanrahan, P.; Tomas Akenine-Moeller and Michael McCoolUtilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even nearoptimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.Item A Programmable Vertex Shader with Fixed-Point SIMD Datapath for Low Power Wireless Applications(The Eurographics Association, 2004) Sohn, Ju-Ho; Woo, Ramchan; Yoo, Hoi-Jun; Tomas Akenine-Moeller and Michael McCoolThe real time 3D graphics becomes one of the attractive applications for 3G wireless terminals although their battery lifetime and memory bandwidth limit the system resources for graphics processing. Instead of using the dedicated hardware engine with complex functions, we propose an efficient hardware architecture of low power vertex shader with programmability. Our architecture includes the following three features: I) a fixed-point SIMD datapath to exploit parallelism in vertex processing while keeping the power consumption low, II) a multithreaded coprocessor interface to decrease unwanted stalls between the main processor and the vertex shader, reducing power consumption by instruction-level power management, III) a programmable vertex engine to increases the datapath throughput by concurrent operations with main processor. Simulation results show that full 3D geometry pipeline can be performed at 7.2M vertices/sec with 115mW power consumption for polygons using the OpenGL lighting model. The improvement is about 10 times greater than that of the latest graphics core with floating-point datapath for wireless applications in terms of processing speed normalized by power consumption, Kvertices/sec per milliwatt.Item Hardware-based Simulation and Collision Detection for Large Particle Systems(The Eurographics Association, 2004) Kolb, A.; Latta, L.; Rezk-Salama, C.; Tomas Akenine-Moeller and Michael McCoolParticle systems have long been recognized as an essential building block for detail-rich and lively visual environments. Current implementations can handle up to 10,000 particles in real-time simulations and are mostly limited by the transfer of particle data from the main processor to the graphics hardware (GPU) for rendering. This paper introduces a full GPU implementation using fragment shaders of both the simulation and rendering of a dynamically-growing particle system. Such an implementation can render up to 1 million particles in real-time on recent hardware. The massively parallel simulation handles collision detection and reaction of particles with objects for arbitrary shape. The collision detection is based on depth maps that represent the outer shape of an object. The depth maps store distance values and normal vectors for collision reaction. Using a special texturebased indexing technique to represent normal vectors, standard 8-bit textures can be used to describe the complete depth map data. Alternately, several depth maps can be stored in one floating point texture. In addition, a GPU-based parallel sorting algorithm is introduced that can be used to perform a depth sorting of the particles for correct alpha blending.Item A Flexible Simulation Framework for Graphics Architectures(The Eurographics Association, 2004) Sheaffer, J. W.; Luebke, D.; Skadron, K.; Tomas Akenine-Moeller and Michael McCoolIn this paper we describe a multipurpose tool for analysis of the performance characteristics of computer graphics hardware and software. We are developing Qsilver, a highly configurable micro-architectural simulator of the GPU that uses the Chromium system's ability to intercept and redirect an OpenGL stream. The simulator produces an annotated trace of graphics commands using Chromium, then runs the trace through a cycle-timer model to evaluate time-dependent behaviors of the various functional units. We demonstrate the use of Qsilver on a simple hypothetical architecture to analyze performance bottlenecks, to explore new GPU microarchitectures, and to model power and leakage properties. One innovation we explore is the use of dynamic voltage scaling across multiple clock domains to achieve significant energy savings at almost negligible performance cost. Finally, we discuss how other architectural features and experiments might be incorporated into the Qsilver framework.Item PixelView: A View-Independent Graphics Rendering Architecture(The Eurographics Association, 2004) Stewart, J.; Bennett, E.P.; McMillan, L.; Tomas Akenine-Moeller and Michael McCoolWe present a new computer graphics rendering architecture that allows all possible views to be extracted from a single traversal of a scene description. It supports a wide range of rendering primitives, including polygonal meshes, higher-order surface primitives (e.g. spheres, cylinders, and parametric patches), point-based models, and image-based representations. To demonstrate our concept, we have implemented a hardware prototype that includes a 4D, z-buffered frame-buffer supporting dynamic view selection at the time of raster scan-out. As a result, our implementation supports extremely low display-update latency. The PixelView architecture also supports rendering of the same scene for multiple eyes, which provides immediate benefits for stereo viewing methods like those used in today s virtual environments, particularly when there are multiple participants. In the future, view-independent graphics rendering hardware will also be essential to support the multitude of viewpoints required for real-time autostereoscopic and holographic display devices.Item Silhouette Maps for Improved Texture Magnification(The Eurographics Association, 2004) Sen, Pradeep; Tomas Akenine-Moeller and Michael McCoolTexture mapping is a simple way of increasing visual realism without adding geometrical complexity. Because it is a discrete process, it is important to properly filter samples when the sampling rate of the texture differs from that of the final image. This is particularly problematic when the texture is magnified or minified. While reasonable approaches exist to tackle the minified case, few options exist for improving the quality of magnified textures in real-time applications. Most simply bilinearly interpolate between samples, yielding exceedingly blurry textures. In this paper, we address the real-time magnification problem by extending the silhouette map algorithm to general texturing. In particular, we discuss the creation of these silmap textures as well as a simple filtering scheme that allows for viewing at all levels of magnification. The technique was implemented on current graphics hardware and our results show that we can achieve a level of visual quality comparable to that of a much larger texture.Item A Quadrilateral Rendering Primitive(The Eurographics Association, 2004) Hormann, Kai; Tarini, Marco; Tomas Akenine-Moeller and Michael McCoolThe only surface primitives that are supported by common graphics hardware are triangles and more complex shapes have to be triangulated before being sent to the rasterizer. Even quadrilaterals, which are frequently used in many applications, are rendered as a pair of triangles after splitting them along either diagonal. This creates an undesirable C1-discontinuity that is visible in the shading or texture signal. We propose a new method that overcomes this drawback and is designed to be implemented in hardware as a new rasterizer. It processes a potentially non-planar quadrilateral directly without any splitting and interpolates attributes smoothly inside the quadrilateral. This interpolation is based on a recent generalization of barycentric coordinates that we adapted to handle perspective correction and situations in which a quadrilateral is partially behind the point of view.Item Tile-Based Texture Mapping on Graphics Hardware(The Eurographics Association, 2004) Wei, Li-Yi; Tomas Akenine-Moeller and Michael McCoolTexture mapping has been a fundamental feature for commodity graphics hardware. However, a key challenge for texture mapping is how to store and manage large textures on graphics processors. In this paper, we present a tilebased texture mapping algorithm by which we only have to physically store a small set of texture tiles instead of a large texture. Our algorithm generates an arbitrarily large and non-periodic virtual texture map from the small set of stored texture tiles. Because we only have to store a small set of tiles, it minimizes the storage requirement to a small constant, regardless of the size of the virtual texture. In addition, the tiles are generated and packed into a single texture map, so that the hardware filtering of this packed texture map corresponds directly to the filtering of the virtual texture. We implement our algorithm as a fragment program, and demonstrate performance on latest graphics processors.Item A Hierarchical Shadow Volume Algorithm(The Eurographics Association, 2004) Aila, Timo; Akenine-Möller, Tomas; Tomas Akenine-Moeller and Michael McCoolThe shadow volume algorithm is a popular technique for real-time shadow generation using graphics hardware. Its major disadvantage is that it is inherently fillrate-limited, as the performance is inversely proportional to the area of the projected shadow volumes. We present a new algorithm that reduces the shadow volume rasterization work significantly. With our algorithm, the amount of per-pixel processing becomes proportional to the screenspace length of the visible shadow boundary instead of the projected area. The first stage of the algorithm finds 8×8 pixel tiles, whose 3D bounding boxes are either completely inside or outside the shadow volume. After that, the second stage performs per-pixel computations only for the potential shadow boundary tiles. We outline a twopass implementation, and also describe an efficient single-pass hardware architecture, in which the two stages are separated using a delay stream. The only modification required in applications is a new pair of calls for marking the beginning and end of a shadow volume. In our test scenes, the algorithm processes up to 11.5 times fewer pixels compared to current state-of-the-art methods, while reducing the external video memory bandwidth by a factor of up to 17.1.Item Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware(The Eurographics Association, 2004) Foley, Tim; Houston, Mike; Hanrahan, Pat; Tomas Akenine-Moeller and Michael McCoolPartitioning fragment shaders into multiple rendering passes is an effective technique for virtualizing shading resource limits in graphics hardware. The Recursive Dominator Split (RDS) algorithm is a polynomial-time algorithm for partitioning fragment shaders for real-time rendering that has been shown to generate efficient partitions. RDS does not, however, work for shaders with multiple outputs, and does not optimize for hardware with support for multiple render targets. We present Merging Recursive Dominator Split (MRDS), an extension of the RDS algorithm to shaders with arbitrary numbers of outputs which can efficiently utilize hardware support for multiple render targets, as well as a new cost metric for evaluating the quality of multipass partitions on modern consumer graphics hardware. We demonstrate that partitions generated by our algorithm execute more efficiently than those generated by RDS alone, and that our cost model is effective in predicting the relative performance of multipass partitions.