Beyond Defaults: Parameter Free Outlier Detection using GLOSH

Ghosh, Kushankur

doi:doi:10.7939/r3-zs51-gq83

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

234 views
259 downloads

Beyond Defaults: Parameter Free Outlier Detection using GLOSH

Author / Creator

Ghosh, Kushankur
In machine learning and data mining, outliers—data points significantly differing from the majority—often pose challenges by introducing irrelevant information. Unsupervised methods are often used for detecting them as the information about outliers is unknown. Global-Local Outlier Scores based on Hierarchies (GLOSH) is an unsupervised outlier detection method within HDBSCAN, a hierarchical clustering approach. GLOSH estimates outlier scores (GLOSH scores) for each data point by comparing its density to the highest density point in its closest cluster within the HDBSCAN hierarchy. However, GLOSH is sensitive to the minpts parameter, that estimates the density in the hierarchy. Different minpts values may result in different hierarchies, with some representing the underlying cluster structure better than others. Given the lack of prior knowledge about the data in practice, it is unlikely to know an appropriate minpts value beforehand, i.e., that assigns higher GLOSH scores to “true outliers” than to “true inliers”. Moreover, to select outliers using GLOSH, one has to pre-define a value n that is used to determine the n datapoints with the highest GLOSH scores. These n datapoints are treated as “potential outliers”. However, in practice, one may not know how many outliers are present in a dataset, making it unlikely to know a suitable value for n.

The first contribution of this thesis is an automated approach that aims at determining the value of minpts from a large range of minpts values that results in the best GLOSH performance. Given a range of minpts values, we can obtain a corresponding range of GLOSH scores for each datapoint, which we call the datapoint’s GLOSH–Profile. We study the behavior of GLOSH–Profiles for distinct outlier types, establishing their ability to distinguish between different kinds of outliers.

Our first major observation is that the minpts value that results in the best overall GLOSH performance corresponds to minpts value where the GLOSH scores in the GLOSH–Profiles start changing at a similar rate. Based on this observation, we develop an automated, unsupervised method to find the minpts value at which the GLOSH scores in the profiles start changing at a similar rate, thus potentially yielding the best or nearly the best results for GLOSH. We apply our method on a range of different synthetic and real datasets with added synthetic outliers, and show that our approach of selecting the minpts value is able to match the best possible performance of GLOSH in a given range of minpts values.

Our second major observation is regarding the minpts value that yields the best GLOSH performance: the GLOSH scores of outlier datapoints notably deviate from the inlier GLOSH scores (for that minpts value) when arranged in increasing order. This observation serves as the key for the second contribution of this thesis—a strategy to estimate a threshold for classifying points into inliers and (potential) outliers, without relying on a pre defined value of n. The proposed approach is evaluated across synthetic, semi-synthetic, and real datasets. The results show that our approach yields a consistently effective threshold.
Subjects / Keywords
Graduation date

Spring 2024
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-zs51-gq83
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sander, Jörg (Computing Science)
- Naldi, Murilo (Computer Science)