Parallelization of Hierarchical Density-Based Clustering using MapReduce

  • Author / Creator
    Syed, Talat Iqbal
  • Cluster analysis plays a very important role for understanding various phenomena about data without any prior knowledge. However, hierarchical clustering algorithms, which are widely used for its representation of data, are computationally expensive. Recently large datasets are prevalent in many scientific domains but the property of data dependency in a hierarchical clustering method makes it difficult to parallelize. We introduce two parallel algorithms for a density-based hierarchical clustering algorithm, HDBSCAN*. The first method called Random Blocks Approach, based on the parallelization of Single Linkage algorithm, computes an exact hierarchy of HDBSCAN* in parallel while the second method, the Recursive Sampling Approach, computes an approximate version of HDBSCAN* in parallel. To improve the accuracy of the Recursive Sampling Approach, we combine it with a data summarization technique called Data Bubbles. We also provide a method to extract clusters at distributed nodes and form an approximate cluster tree without traversing the complete hierarchy. The algorithms are implemented using the MapReduce Framework and results are evaluated in terms of both accuracy and speed on various datasets.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Computing Science
  • Supervisor / co-supervisor and their department(s)
    • Sander, Joerg (Computing Science)
  • Examining committee members and their departments
    • Sander, Joerg (Computing Science)
    • Zaiane, Osmar (Computing Science)
    • Stroulia, Eleni (Computing Science)