Guo, HanqiDi, ShengGupta, RinkuPeterka, TomCappello, FranckHank Childs and Fernando Cucchietti2018-06-022018-06-022018978-3-03868-054-31727-348Xhttps://doi.org/10.2312/pgv.20181099https://diglib.eg.org:443/handle/10.2312/pgv20181099We design and implement La VALSE-a scalable visualization tool to explore tens of millions of records of reliability, availability, and serviceability (RAS) logs-for IBM Blue Gene/Q systems. Our tool is designed to meet various analysis requirements, including tracing causes of failure events and investigating correlations from the redundant and noisy RAS messages. La VALSE consists of multiple linked views to visualize RAS logs; each log message has a time stamp, physical location, network address, and multiple categorical dimensions such as severity and category. The timeline view features the scalable ThemeRiver and arc diagrams that enables interactive exploration of tens of millions of log messages. The spatial view visualizes the occurrences of RAS messages on hundreds of thousands of elements of Mira-compute cards, node boards, midplanes, and racks-with viewdependent level-of-detail rendering. The multidimensional view enables interactive filtering of different categorical dimensions of RAS messages. To achieve interactivity, we develop an efficient and scalable online data cube engine that can query 55 million RAS logs in less than one second. We present several case studies on Mira, a top supercomputer at Argonne National Laboratory. The case studies demonstrate that La VALSE can help users quickly identify the sources of failure events and analyze spatiotemporal correlations of RAS messages in different scales.La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers10.2312/pgv.2018109991-100