A Framework for Hierarchical Density-Based Clustering Exploration

  • Author / Creator
    Cavalcante Araujo Neto, Antonio
  • HDBSCAN* is a hierarchical density-based clustering method that requires a single parameter mpts, a smoothing factor that implicitly influences which clusters are more detectable in the resulting clustering hierarchy. While a small change in mpts typically leads to a small change in the clustering structure, choosing a “good” mpts value can be challenging: depending on the data distribution, a high or low mpts value may be more appropriate, and certain clusters may reveal themselves at different values. This thesis aims at studying the problems related to the effects of mpts on the clustering hierarchies produced by HDBSCAN. We present an analysis of HDBSCAN’s density estimator and discuss how it could be improved to mitigate the issues with the choice of a mpts value. We also discuss how this modification affects the results obtained by HDBSCAN* and why one might still need to explore multiple parameter settings to better understand the cluster structures in the data. Hence, we are interested in the efficient computation and exploration of hierarchies constructed under different parameter settings, and in strategies that can ease the task of finding the appropriate amount of smoothing for density estimation in different regions of the data. More specifically, we propose a strategy that is able to efficiently compute over 100 clustering hierarchies with the computational cost of running HDBSCAN* twice. Our strategy is based on the replacement of the complete graph in HDBSCAN* with a much smaller graph, the RNG, that provably contains all the information needed to compute a set of clustering hierarchies for a range of mpts values. In order to help with the analysis of the clustering hierarchies computed with our RNG-based strategy, we propose MustaCHE, a visualization tool that helps users explore a set of clustering hierarchies and focus their analyses on values of mpts that produce “significantly” different results. Moreover, we observed that, for some datasets, a single value of mpts is not enough to reveal all the cluster structures in the data simultaneously. Therefore, we discuss how HDBSCAN* can be used to compute hierarchies that contain cluster structures found with different values of mpts, and how users can select which values of mpts are to be used in different parts of the data to construct clustering hierarchies in this fashion. While these contributions were made in the context of unsupervised clustering with HDBSCAN*, their relevance go beyond their original purpose. Thus, we discuss how each of the contributions presented in this thesis can be extended to a class of semi-supervised clustering and semi-supervised classification algorithms.

  • Subjects / Keywords
  • Graduation date
    Spring 2021
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.