Usage
  • 31 views
  • 31 downloads

Beyond Defaults: Parameter Free Outlier Detection using GLOSH

  • Author / Creator
    Ghosh, Kushankur
  • In machine learning and data mining, outliers—data points significantly differing from the majority—often pose challenges by introducing irrelevant information. Unsupervised methods are often used for detecting them as the information about outliers is unknown. Global-Local Outlier Scores based on Hierarchies (GLOSH) is an unsupervised outlier detection method within HDBSCAN, a hierarchical clustering approach. GLOSH estimates outlier scores (GLOSH scores) for each data point by comparing its density to the highest density point in its closest cluster within the HDBSCAN hierarchy. However, GLOSH is sensitive to the minpts parameter, that estimates the density in the hierarchy. Different minpts values may result in different hierarchies, with some representing the underlying cluster structure better than others. Given the lack of prior knowledge about the data in practice, it is unlikely to know an appropriate minpts value beforehand, i.e., that assigns higher GLOSH scores to “true outliers” than to “true inliers”. Moreover, to select outliers using GLOSH, one has to pre-define a value n that is used to determine the n datapoints with the highest GLOSH scores. These n datapoints are treated as “potential outliers”. However, in practice, one may not know how many outliers are present in a dataset, making it unlikely to know a suitable value for n.

    The first contribution of this thesis is an automated approach that aims at determining the value of minpts from a large range of minpts values that results in the best GLOSH performance. Given a range of minpts values, we can obtain a corresponding range of GLOSH scores for each datapoint, which we call the datapoint’s GLOSH–Profile. We study the behavior of GLOSH–Profiles for distinct outlier types, establishing their ability to distinguish between different kinds of outliers.

    Our first major observation is that the minpts value that results in the best overall GLOSH performance corresponds to minpts value where the GLOSH scores in the GLOSH–Profiles start changing at a similar rate. Based on this observation, we develop an automated, unsupervised method to find the minpts value at which the GLOSH scores in the profiles start changing at a similar rate, thus potentially yielding the best or nearly the best results for GLOSH. We apply our method on a range of different synthetic and real datasets with added synthetic outliers, and show that our approach of selecting the minpts value is able to match the best possible performance of GLOSH in a given range of minpts values.

    Our second major observation is regarding the minpts value that yields the best GLOSH performance: the GLOSH scores of outlier datapoints notably deviate from the inlier GLOSH scores (for that minpts value) when arranged in increasing order. This observation serves as the key for the second contribution of this thesis—a strategy to estimate a threshold for classifying points into inliers and (potential) outliers, without relying on a pre defined value of n. The proposed approach is evaluated across synthetic, semi-synthetic, and real datasets. The results show that our approach yields a consistently effective threshold.

  • Subjects / Keywords
  • Graduation date
    Spring 2024
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-zs51-gq83
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.