A New Method For Semi-Supervised Density-Based Projected Clustering

  • Author / Creator
    Jullion, Zachary M
  • Density-based clustering methods extract high density clusters which are separated by regions of lower density. HDBSCAN* is an existing algorithm for producing a density-based cluster hierarchy. To obtain clusters from this hierarchy it includes an instance of FOSC(Framework for Optimal Selection of Clusters) to extract significant clusters, based on a measure known as cluster stability. We introduce CASAR (Compact And Separation Adjusted Ratio), a new algorithm for extracting significant clusters from an HDBSCAN* hierarchy. CASAR issimilar to FOSC, but defines local cluster quality differently and also uses a different aggregation method for comparing the quality of descendant clusters to ancestors in the hierarchy. The local cluster quality that CASAR uses is based on the validation index DBCV (Density-Based Cluster Validation). CASAR is designed to extract individual density-based clusters from subspaces, and is not meant to be a general purpose replacement for cluster stability. We also introduce a new semi-supervised density-based method for finding relevant subspaces. Given a set of should-link objects that belong to an undiscovered cluster, our method finds an appropriate set of attributes for extracting the cluster. Our method makes use of well-established qualities of density-based clusters, and as such, it can be used as a pre-processing step for a wide variety of different density-based clustering algorithms. We combine this method with HDBSCAN* and CASAR to produce a semi-supervised density-based projected clustering algorithm. In a series of experiments, we compare CASAR and cluster stability on both synthetic data and on real data sets. We also compare our semi-supervised density-based projected clustering algorithm to an existing semi-supervised projected clustering algorithm and to a well-known unsupervised projected clustering algorithm. We conclude this thesis with a summary of the strengths and weaknesses of our method, a summary of experimental findings, and a discussion about possible directions for future work.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Computing Science
  • Supervisor / co-supervisor and their department(s)
    • Sander, Joerg (Computing Science)
  • Examining committee members and their departments
    • Campello, Ricardo (Computing Science)
    • Sander, Joerg (Computing Science)
    • Nascimento, Mario (Computing Science)