A Framework for Hierarchical Density-Based Clustering Exploration

Cavalcante Araujo Neto, Antonio

doi:doi:10.7939/r3-kdyj-rg16

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

369 views
898 downloads

A Framework for Hierarchical Density-Based Clustering Exploration

Author / Creator

Cavalcante Araujo Neto, Antonio
HDBSCAN* is a hierarchical density-based clustering method that requires a single parameter mpts, a smoothing factor that implicitly influences which clusters are more detectable in the resulting clustering hierarchy. While a small change in mpts typically leads to a small change in the clustering structure, choosing a “good” mpts value can be challenging: depending on the data distribution, a high or low mpts value may be more appropriate, and certain clusters may reveal themselves at different values. This thesis aims at studying the problems related to the effects of mpts on the clustering hierarchies produced by HDBSCAN. We present an analysis of HDBSCAN’s density estimator and discuss how it could be improved to mitigate the issues with the choice of a mpts value. We also discuss how this modification affects the results obtained by HDBSCAN* and why one might still need to explore multiple parameter settings to better understand the cluster structures in the data. Hence, we are interested in the efficient computation and exploration of hierarchies constructed under different parameter settings, and in strategies that can ease the task of finding the appropriate amount of smoothing for density estimation in different regions of the data. More specifically, we propose a strategy that is able to efficiently compute over 100 clustering hierarchies with the computational cost of running HDBSCAN* twice. Our strategy is based on the replacement of the complete graph in HDBSCAN* with a much smaller graph, the RNG, that provably contains all the information needed to compute a set of clustering hierarchies for a range of mpts values. In order to help with the analysis of the clustering hierarchies computed with our RNG-based strategy, we propose MustaCHE, a visualization tool that helps users explore a set of clustering hierarchies and focus their analyses on values of mpts that produce “significantly” different results. Moreover, we observed that, for some datasets, a single value of mpts is not enough to reveal all the cluster structures in the data simultaneously. Therefore, we discuss how HDBSCAN* can be used to compute hierarchies that contain cluster structures found with different values of mpts, and how users can select which values of mpts are to be used in different parts of the data to construct clustering hierarchies in this fashion. While these contributions were made in the context of unsupervised clustering with HDBSCAN*, their relevance go beyond their original purpose. Thus, we discuss how each of the contributions presented in this thesis can be extended to a class of semi-supervised clustering and semi-supervised classification algorithms.
Subjects / Keywords
Graduation date

Spring 2021
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-kdyj-rg16
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Jörg Sander (Computing Science)
- Ricardo Campello (Computing Science)