EGGH04: SIGGRAPH/Eurographics Workshop on Graphics Hardware 2004ISBN 3-905673-15-0https://diglib.eg.org:443/handle/10.2312/3312019-11-11T20:20:36Z2019-11-11T20:20:36ZUnderstanding the Efficiency of GPU Algorithms for Matrix-Matrix MultiplicationFatahalian, K.Sugerman, J.Hanrahan, P.https://diglib.eg.org:443/handle/10.2312/EGGH.EGGH04.133-1382017-03-16T14:27:17Z2004-01-01T00:00:00ZUnderstanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication
Fatahalian, K.; Sugerman, J.; Hanrahan, P.
Tomas Akenine-Moeller and Michael McCool
Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even nearoptimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.
2004-01-01T00:00:00ZHardware-based Simulation and Collision Detection for Large Particle SystemsKolb, A.Latta, L.Rezk-Salama, C.https://diglib.eg.org:443/handle/10.2312/EGGH.EGGH04.123-1322017-03-16T14:26:34Z2004-01-01T00:00:00ZHardware-based Simulation and Collision Detection for Large Particle Systems
Kolb, A.; Latta, L.; Rezk-Salama, C.
Tomas Akenine-Moeller and Michael McCool
Particle systems have long been recognized as an essential building block for detail-rich and lively visual environments. Current implementations can handle up to 10,000 particles in real-time simulations and are mostly limited by the transfer of particle data from the main processor to the graphics hardware (GPU) for rendering. This paper introduces a full GPU implementation using fragment shaders of both the simulation and rendering of a dynamically-growing particle system. Such an implementation can render up to 1 million particles in real-time on recent hardware. The massively parallel simulation handles collision detection and reaction of particles with objects for arbitrary shape. The collision detection is based on depth maps that represent the outer shape of an object. The depth maps store distance values and normal vectors for collision reaction. Using a special texturebased indexing technique to represent normal vectors, standard 8-bit textures can be used to describe the complete depth map data. Alternately, several depth maps can be stored in one floating point texture. In addition, a GPU-based parallel sorting algorithm is introduced that can be used to perform a depth sorting of the particles for correct alpha blending.
2004-01-01T00:00:00ZA Programmable Vertex Shader with Fixed-Point SIMD Datapath for Low Power Wireless ApplicationsSohn, Ju-HoWoo, RamchanYoo, Hoi-Junhttps://diglib.eg.org:443/handle/10.2312/EGGH.EGGH04.107-1142017-03-16T14:27:29Z2004-01-01T00:00:00ZA Programmable Vertex Shader with Fixed-Point SIMD Datapath for Low Power Wireless Applications
Sohn, Ju-Ho; Woo, Ramchan; Yoo, Hoi-Jun
Tomas Akenine-Moeller and Michael McCool
The real time 3D graphics becomes one of the attractive applications for 3G wireless terminals although their battery lifetime and memory bandwidth limit the system resources for graphics processing. Instead of using the dedicated hardware engine with complex functions, we propose an efficient hardware architecture of low power vertex shader with programmability. Our architecture includes the following three features: I) a fixed-point SIMD datapath to exploit parallelism in vertex processing while keeping the power consumption low, II) a multithreaded coprocessor interface to decrease unwanted stalls between the main processor and the vertex shader, reducing power consumption by instruction-level power management, III) a programmable vertex engine to increases the datapath throughput by concurrent operations with main processor. Simulation results show that full 3D geometry pipeline can be performed at 7.2M vertices/sec with 115mW power consumption for polygons using the OpenGL lighting model. The improvement is about 10 times greater than that of the latest graphics core with floating-point datapath for wireless applications in terms of processing speed normalized by power consumption, Kvertices/sec per milliwatt.
2004-01-01T00:00:00ZUberFlow: A GPU-Based Particle EngineKipfer, PeterSegal, MarkWestermann, RĂ¼digerhttps://diglib.eg.org:443/handle/10.2312/EGGH.EGGH04.115-1222017-03-16T14:27:46Z2004-01-01T00:00:00ZUberFlow: A GPU-Based Particle Engine
Kipfer, Peter; Segal, Mark; Westermann, RĂ¼diger
Tomas Akenine-Moeller and Michael McCool
We present a system for real-time animation and rendering of large particle sets using GPU computation and memory objects in OpenGL. Memory objects can be used both as containers for geometry data stored on the graphics card and as render targets, providing an effective means for the manipulation and rendering of particle data on the GPU. To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform particle manipulation are essential. Our system implements a versatile particle engine, including inter-particle collisions and visibility sorting. By combining memory objects with fioating-point fragment programs, we have implemented a particle engine that entirely avoids the transfer of particle data at run-time. Our system can be seen as a forerunner of a new class of graphics algorithms, exploiting memory objects or similar concepts on upcoming graphics hardware to avoid bus bandwidth becoming the major performance bottleneck.
2004-01-01T00:00:00Z