Visual Analytics and Imaging Laboratory (VAI Lab)
Computer Science Department, Stony Brook University, NY

A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications

Abstract: Anomalous runtime behavior detection is one of the most important tasks for performance diagnosis in High Performance Computing (HPC). Most of the existing methods find anomalous executions based on the properties of individual functions, such as execution time. However, it is insufficient to identify abnormal behavior without taking into account the context of the executions, such as the invocations of children functions and the communications with other HPC nodes. We improve upon the existing anomaly detection approaches by utilizing the call stack structures of the executions, which record rich temporal and contextual information. With our call stack tree (CSTree) representation of the executions, we formulate the anomaly detection problem as finding anomalous tree structures in a call stack forest. The CSTrees are converted to vector representations using our proposed stack2vec embedding. Structural and temporal visualizations of CSTrees are provided to support users in the identification and verification of the anomalies during an active anomaly detection process. Three case studies of real-world HPC applications demonstrate the capabilities of our approach.

Teaser:The interface of our system for anomalous call stack tree (CSTree) detection:

Panel (a): The scatter plot shows the projection of our stack2vec embeddings of the CSTrees. Each point in the projection represents a CSTree. Panel (b): Summary structures of the top candidate anomalies from panel (a). Panel (c): The user can investigate the detailed structure and the anomalous subtrees of a CSTree of interest. Panel (d): The level-of-detail timeline visualization of the selected CSTree shows the temporal pattern of the invocations and the communications between the HPC nodes. Panel (e): The user is able to label the CSTrees of interest after exploration to update the anomaly detection model.

Video: Watch it to get a quick overview:

Paper: C. Xie, W. Xu, K. Mueller, "A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications," IEEE Trans. on Visualization and Computer Graphics, 25(1): 215-224, 2019. (won best paper runner-up at VIS 2018) pdf

Funding: NSF grant IIS-1527200, BNL LDRD grant 16-041, ECP CODAR project 17-SC-20-SC,