Download the full-sized PDF
Permanent link (DOI): https://doi.org/10.7939/R3D679
This file is in the following communities:
|Graduate Studies and Research, Faculty of|
This file is in the following collections:
|Theses and Dissertations|
Finding, Evaluating and Exploring Clustering Alternatives Unsupervised and Semi-supervised Open Access
- Other title
Hierarchical Density-Based Clustering
Density-Based Clustering Validation
Semi-supervised Model Selection
- Type of item
- Degree grantor
University of Alberta
- Author or creator
- Supervisor and department
Sander, Jörg (Computing Science)
- Examining committee member and department
Zaïane, Osmar R. (Computing Science)
Greiner , Russell (Computing Science)
Campello, Ricardo J.G.B. (Computer Sciences)
Spiliopoulou, Myra (Computer Science)
Department of Computing Science
- Date accepted
- Graduation date
Doctor of Philosophy
- Degree level
Clustering aims at grouping data objects into meaningful clusters using no (or only a small amount of) supervision. This thesis studies two major clustering paradigms: density-based and semi-supervised clustering. Density-based clustering algorithms seek partitions with high-density areas of points (clusters that are not necessarily globular) separated by low-density areas that may contain noise objects. Semi-supervised clustering algorithms use a small amount of information about data to guide the clustering task.
In the context of density-based clustering, we study (a) the validation of density-based clustering and (b) hierarchical density-based clustering.
The validation of density-based clustering, i.e., the objective and quantitative assessment of clustering results, is one of the most challenging aspects of clustering. Numerous different relative validity criteria have been proposed for the validation of globular clusters. Not all data, however, are composed of globular clusters. We propose a relative density-based validation index, DBCV, that assesses the quality of an arbitrarily-shaped clustering based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results.
In addition to the DBCV, we make several major contributions in the area of hierarchical density-based clustering. We improve on the AUTO-HDS framework for automated clustering and visualization of biological data sets by removing a parameter thereby making the cluster extraction stage simpler and more accurate. We also propose a theoretically and practically improved general hierarchical density-based clustering, called GHDBSCAN, which generalizes the density-based clustering by recognizing its essential components and based on this generalization we propose two algorithms, GHDBSCAN(NMRD) and GHDBSCAN(NMRD+PF), which improve over previous state-of-the-art methods both theoretically and practically.
Regarding semi-supervised clustering, we use the knowledge available about a dataset in the form of constraints to guide the clustering algorithm. In this context, we provide two approaches for model selection that allow the user to select the best model based on few constraints and/or the DBCV value and also discuss a framework for extracting a partitional clustering from a hierarchical clustering tree.
- Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
- Citation for previous publication
D. Moulavi, P.A. Jaskowiak, R.J.G.B. Campello, A. Zimek, J. Sander. Density-Based Clustering Validation. Proc. of the 2014 SIAM International Conference on Data Mining (SDM), Philadelphia, PA, USA, 2014.M. Pourrajabi, D. Moulavi, R.J.G.B. Campello, A. Zimek, J. Sander, R. Goebel. Model Selection for Semi-Supervised Clustering. Proc. of the 17th Int. Conf. on Extending Database Technology (EDBT), Athens, Greece, 2014.R.J.G.B. Campello, D. Moulavi, J. Sander. A Simpler and More Accurate AUTO-HDS Framework for Clustering and Visualization of Biological Data, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Vol. 9, No. 6, 1850-1852, 2012.R.J.G.B. Campello, D. Moulavi, J. Sander. Density-Based Clustering Based on Hierarchical Density Estimates. Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), LNAI 7819, Gold Coast, Australia, 2013, 160-172.R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander. A Framework for Semi-Supervised and Unsupervised Optimal Extraction of Clusters from Hierarchies, Data Mining and Knowledge Discovery (DMKD), Vol. 27, 344-371, 2013.
- Date Uploaded
- Date Modified
- Audit Status
- Audits have not yet been run on this file.
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 1899388
Last modified: 2015:10:12 11:55:30-06:00
Original checksum: 7bf777e7ab4331a315606d9d886aff18
Well formed: true
Status message: Too many fonts to report; some fonts omitted. Total fonts = 1085
Page count: 167