ERA

Download the full-sized PDF of Parallelization of Hierarchical Density-Based Clustering using MapReduceDownload the full-sized PDF

Analytics

Share

Permanent link (DOI): https://doi.org/10.7939/R33F4KW5W

Download

Export to: EndNote  |  Zotero  |  Mendeley

Communities

This file is in the following communities:

Graduate Studies and Research, Faculty of

Collections

This file is in the following collections:

Theses and Dissertations

Parallelization of Hierarchical Density-Based Clustering using MapReduce Open Access

Descriptions

Other title
Subject/Keyword
Hadoop
MapReduce
Density-based clustering
Data Bubbles
HDBSCAN
Parallelization
Type of item
Thesis
Degree grantor
University of Alberta
Author or creator
Syed, Talat Iqbal
Supervisor and department
Sander, Joerg (Computing Science)
Examining committee member and department
Sander, Joerg (Computing Science)
Zaiane, Osmar (Computing Science)
Stroulia, Eleni (Computing Science)
Department
Department of Computing Science
Specialization

Date accepted
2015-02-20T11:09:03Z
Graduation date
2015-06
Degree
Master of Science
Degree level
Master's
Abstract
Cluster analysis plays a very important role for understanding various phenomena about data without any prior knowledge. However, hierarchical clustering algorithms, which are widely used for its representation of data, are computationally expensive. Recently large datasets are prevalent in many scientific domains but the property of data dependency in a hierarchical clustering method makes it difficult to parallelize. We introduce two parallel algorithms for a density-based hierarchical clustering algorithm, HDBSCAN*. The first method called Random Blocks Approach, based on the parallelization of Single Linkage algorithm, computes an exact hierarchy of HDBSCAN* in parallel while the second method, the Recursive Sampling Approach, computes an approximate version of HDBSCAN* in parallel. To improve the accuracy of the Recursive Sampling Approach, we combine it with a data summarization technique called Data Bubbles. We also provide a method to extract clusters at distributed nodes and form an approximate cluster tree without traversing the complete hierarchy. The algorithms are implemented using the MapReduce Framework and results are evaluated in terms of both accuracy and speed on various datasets.
Language
English
DOI
doi:10.7939/R33F4KW5W
Rights
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
Citation for previous publication

File Details

Date Uploaded
Date Modified
2015-06-15T07:07:33.569+00:00
Audit Status
Audits have not yet been run on this file.
Characterization
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 4497726
Last modified: 2015:10:22 04:17:11-06:00
Filename: Syed_TalatIqbal_201502_MSc.pdf
Original checksum: 4e5b63d7cfd91e70710e73aa01f7de77
Well formed: true
Valid: true
Page count: 130
Activity of users you follow
User Activity Date