Download the full-sized PDF of Integration and Evaluation of Different Kernel Density Estimates in Hierarchical Density-Based ClusteringDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Integration and Evaluation of Different Kernel Density Estimates in Hierarchical Density-Based Clustering Open Access


Other title
kernel density estimation
density-based clustering
data generator with hierarchical ground truth
hierarchical cluster analysis
Type of item
Degree grantor
University of Alberta
Author or creator
Khare, Kriti
Supervisor and department
Sander, Joerg (Department of Computing Science)
Examining committee member and department
Nascimento, Mario (Department of Computing Science, University of Alberta)
Sander, Joerg (Department of Computing Science, University of Alberta)
Campello, Ricardo (University of Sao Paulo at Sao Carlos)
Department of Computing Science

Date accepted
Graduation date
2016-06:Fall 2016
Master of Science
Degree level
Most machine learning methods make assumptions about data. Parametric statistics assume that the data is sampled from a distribution with fixed properties set by the algorithm or user. In contrast, non-parametric statistics do not assume the properties of a distribution. Instead, they assume that the data comes from a distribution and then they estimate the properties of this distribution for each object. Cluster analysis is a machine learning method for finding meaningful groups in the data when there is little or no prior information available about the data being analyzed. Clustering techniques can be broadly divided into partitional algorithms, that produce unnested clusters, and hierarchical clustering, in which there are nested clusters. In the last decade, there has been an increased interest in developing non-parametric clustering algorithms with increase in computation power of machines. However, so far, the literature on non-parametric clustering has focused on partitional techniques. This thesis proposes a generalization of the hierarchical density based clustering algorithm, HDBSCAN*, that implements arbitrary non-parametric density estimates. For that, we first analyze the challenges faced in using kernel density estimates for a hierarchical clustering algorithm, and present different ways of defining connectivity in density space. Kernel density estimates have a parameter known as the bandwidth, which determines the degree of influence of distant objects on the density estimate of the object of interest. We do an exhaustive search for the best bandwidth, and compare the accuracy of the cluster assignment obtained from it with the accuracy obtained by using heuristic bandwidth estimators found in literature. The evaluation of hierarchical clustering results is typically done by extracting a flat partition from a hierarchy and comparing this extracted partition with a "ground-truth" partition of the same dataset, which may come from class labels given independently to a dataset. In order to truly evaluate the quality of a constructed hierarchy using different density estimates, however, one would need hierarchical ground truth. Hierarchical structures that could be used as ground truth are only available in rare cases, and therefore, we also develop in this thesis, a data generator that produces the hierarchical ground truth along with the dataset to provide additional data sets for our experimental evaluation. We do an extensive study comparing the accuracy of the cluster assignments when using different strategies for setting the bandwidth for certain Kernels from literature, and comparing it with previous proposals for hierarchical density based clustering based on unnormalized knn-Kernels. Based on the experimental results, using both flat and hierarchical ground truth labels, we conclude that there are connectivity methods that give good results, however, they may be computationally expensive for some Kernels. Also, the heuristic bandwidth estimators do not perform better, in general, than previous proposals based on knn-Kernels. Therefore, new heuristics must be developed for finding the bandwidth to use for clustering purposes with the proposed connectivity methods.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 1977705
Last modified: 2016:11:16 14:50:20-07:00
Filename: Khare_Kriti_201609_MSc.pdf
Original checksum: a9e458b0e28347ed2be248c59f764ef4
Well formed: false
Valid: false
Status message: Invalid page tree node offset=956482
Status message: Unexpected error in findFonts java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary offset=3675
Page count: 72
Activity of users you follow
User Activity Date