Integration and Evaluation of Different Kernel Density Estimates in Hierarchical Density-Based Clustering

Khare, Kriti

doi:doi:10.7939/R3B853V02

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

323 views
605 downloads

Integration and Evaluation of Different Kernel Density Estimates in Hierarchical Density-Based Clustering

Author / Creator

Khare, Kriti
Most machine learning methods make assumptions about data. Parametric statistics assume that the data is sampled from a distribution with fixed properties set by the algorithm or user. In contrast, non-parametric statistics do not assume the properties of a distribution. Instead, they assume that the data comes from a distribution and then they estimate the properties of this distribution for each object. Cluster analysis is a machine learning method for finding meaningful groups in the data when there is little or no prior information available about the data being analyzed. Clustering techniques can be broadly divided into partitional algorithms, that produce unnested clusters, and hierarchical clustering, in which there are nested clusters. In the last decade, there has been an increased interest in developing non-parametric clustering algorithms with increase in computation power of machines. However, so far, the literature on non-parametric clustering has focused on partitional techniques. This thesis proposes a generalization of the hierarchical density based clustering algorithm, HDBSCAN*, that implements arbitrary non-parametric density estimates. For that, we first analyze the challenges faced in using kernel density estimates for a hierarchical clustering algorithm, and present different ways of defining connectivity in density space. Kernel density estimates have a parameter known as the bandwidth, which determines the degree of influence of distant objects on the density estimate of the object of interest. We do an exhaustive search for the best bandwidth, and compare the accuracy of the cluster assignment obtained from it with the accuracy obtained by using heuristic bandwidth estimators found in literature. The evaluation of hierarchical clustering results is typically done by extracting a flat partition from a hierarchy and comparing this extracted partition with a "ground-truth" partition of the same dataset, which may come from class labels given independently to a dataset. In order to truly evaluate the quality of a constructed hierarchy using different density estimates, however, one would need hierarchical ground truth. Hierarchical structures that could be used as ground truth are only available in rare cases, and therefore, we also develop in this thesis, a data generator that produces the hierarchical ground truth along with the dataset to provide additional data sets for our experimental evaluation. We do an extensive study comparing the accuracy of the cluster assignments when using different strategies for setting the bandwidth for certain Kernels from literature, and comparing it with previous proposals for hierarchical density based clustering based on unnormalized knn-Kernels. Based on the experimental results, using both flat and hierarchical ground truth labels, we conclude that there are connectivity methods that give good results, however, they may be computationally expensive for some Kernels. Also, the heuristic bandwidth estimators do not perform better, in general, than previous proposals based on knn-Kernels. Therefore, new heuristics must be developed for finding the bandwidth to use for clustering purposes with the proposed connectivity methods.
Subjects / Keywords
Graduation date

Fall 2016
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/R3B853V02
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sander, Joerg (Department of Computing Science)
Examining committee members and their departments
- Nascimento, Mario (Department of Computing Science, University of Alberta)
- Sander, Joerg (Department of Computing Science, University of Alberta)
- Campello, Ricardo (University of Sao Paulo at Sao Carlos)