On Using Classification Datasets to Evaluate Graph Outlier Detection: Peculiar Observations and New Insights

Big Data. 2023 Jun;11(3):151-180. doi: 10.1089/big.2021.0069. Epub 2021 Dec 3.

Abstract

It is common practice of the outlier mining community to repurpose classification datasets toward evaluating various detection models. To that end, often a binary classification dataset is used, where samples from one of the classes are designated as the "inlier" samples, and the other class is substantially down-sampled to create the (ground-truth) "outlier" samples. Graph-level outlier detection (GLOD) is rarely studied but has many potentially influential real-world applications. In this study, we identify an intriguing issue with repurposing graph classification datasets for GLOD. We find that ROC-AUC performance of the models changes significantly ("flips" from high to very low, even worse than random) depending on which class is down-sampled. Interestingly, ROC-AUCs on these two variants approximately sum to 1 and their performance gap is amplified with increasing propagations for a certain family of propagation-based outlier detection models. We carefully study the graph embedding space produced by propagation-based models and find two driving factors: (1) disparity between within-class densities, which is amplified by propagation, and (2) overlapping support (mixing of embeddings) across classes. We also study other graph embedding methods and downstream outlier detectors, and we find that the intriguing "performance flip" issue still widely exists but which version of the down-sample achieves higher performance may vary. Thoughtful analysis over comprehensive results further deepens our understanding of the established issue. With this study, we aim at drawing attention to this (to our knowledge) previously unnoticed issue for the rarely studied GLOD problem, and specifically to the following questions: (1) Given the performance flip issue we identified, where one version of the down-sample often yields worse-than-random performance, is it appropriate to evaluate GLOD by average performance across all down-sampled versions when repurposing graph classification datasets? (2) Considering consistently observed performance flip issue across different graph embedding methods we studied, is it possible to design better graph embedding methods to overcome the issue? We conclude the article with our insights to these questions.

Keywords: classification datasets; graph propagation; outlier evaluation.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Area Under Curve*