Integration and Evaluation of Different Kernel Density Estimates in Hierarchical Density-Based Clustering

  • Author / Creator
    Khare, Kriti
  • Most machine learning methods make assumptions about data. Parametric statistics assume that the data is sampled from a distribution with fixed properties set by the algorithm or user. In contrast, non-parametric statistics do not assume the properties of a distribution. Instead, they assume that the data comes from a distribution and then they estimate the properties of this distribution for each object. Cluster analysis is a machine learning method for finding meaningful groups in the data when there is little or no prior information available about the data being analyzed. Clustering techniques can be broadly divided into partitional algorithms, that produce unnested clusters, and hierarchical clustering, in which there are nested clusters. In the last decade, there has been an increased interest in developing non-parametric clustering algorithms with increase in computation power of machines. However, so far, the literature on non-parametric clustering has focused on partitional techniques. This thesis proposes a generalization of the hierarchical density based clustering algorithm, HDBSCAN*, that implements arbitrary non-parametric density estimates. For that, we first analyze the challenges faced in using kernel density estimates for a hierarchical clustering algorithm, and present different ways of defining connectivity in density space. Kernel density estimates have a parameter known as the bandwidth, which determines the degree of influence of distant objects on the density estimate of the object of interest. We do an exhaustive search for the best bandwidth, and compare the accuracy of the cluster assignment obtained from it with the accuracy obtained by using heuristic bandwidth estimators found in literature. The evaluation of hierarchical clustering results is typically done by extracting a flat partition from a hierarchy and comparing this extracted partition with a "ground-truth" partition of the same dataset, which may come from class labels given independently to a dataset. In order to truly evaluate the quality of a constructed hierarchy using different density estimates, however, one would need hierarchical ground truth. Hierarchical structures that could be used as ground truth are only available in rare cases, and therefore, we also develop in this thesis, a data generator that produces the hierarchical ground truth along with the dataset to provide additional data sets for our experimental evaluation. We do an extensive study comparing the accuracy of the cluster assignments when using different strategies for setting the bandwidth for certain Kernels from literature, and comparing it with previous proposals for hierarchical density based clustering based on unnormalized knn-Kernels. Based on the experimental results, using both flat and hierarchical ground truth labels, we conclude that there are connectivity methods that give good results, however, they may be computationally expensive for some Kernels. Also, the heuristic bandwidth estimators do not perform better, in general, than previous proposals based on knn-Kernels. Therefore, new heuristics must be developed for finding the bandwidth to use for clustering purposes with the proposed connectivity methods.

  • Subjects / Keywords
  • Graduation date
    Fall 2016
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
  • Supervisor / co-supervisor and their department(s)
  • Examining committee members and their departments
    • Nascimento, Mario (Department of Computing Science, University of Alberta)
    • Sander, Joerg (Department of Computing Science, University of Alberta)
    • Campello, Ricardo (University of Sao Paulo at Sao Carlos)