High-dimensional data mining: subspace clustering, outlier detection and applications to classification

Foss, Andrew

doi:doi:10.7939/R3H98ZQ6J

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

288 views
392 downloads

High-dimensional data mining: subspace clustering, outlier detection and applications to classification

Author / Creator

Foss, Andrew
Data mining in high dimensionality almost inevitably faces the consequences of increasing sparsity and declining differentiation between points. This is problematic because we usually exploit these differences for approaches such as clustering and outlier detection. In addition, the exponentially increasing sparsity tends to increase false negatives when clustering. In this thesis, we address the problem of solving high-dimensional problems using low-dimensional solutions. In clustering, we provide a new framework MAXCLUS for finding candidate subspaces and the clusters within them using only two-dimensional clustering. We demonstrate this through an implementation GCLUS that outperforms many state-of-the-art clustering algorithms and is particularly robust with respect to noise. It also handles overlapping clusters and provides either hard' orfuzzy' clustering results as desired. In order to handle extremely high dimensional problems, such as genome microarrays, given some sample-level diagnostic labels, we provide a simple but effective classifier GSEP which weights the features so that the most important can be fed to GCLUS. We show that this leads to small numbers of features (e.g. genes) that can distinguish the diagnostic classes and thus are candidates for research for developing therapeutic applications. In the field of outlier detection, several novel algorithms suited to high-dimensional data are presented (TENT, TROF, FASTOUT). It is shown that these algorithms outperform the state-of-the-art outlier detection algorithms in ranking outlierness for many datasets regardless of whether they contain rare classes or not. Our research into high-dimensional outlier detection has even shown that our approach can be a powerful means of classification for heavily overlapping classes given sufficiently high dimensionality and that this phenomenon occurs solely due to the differences in variance among the classes. On some difficult datasets, this unsupervised approach yielded better separation than the very best supervised classifiers and on other data, the results are competitive with state-of-the-art supervised approaches.\kern-1pt The elucidation of this novel approach to classification opens a new field in data mining, classification through differences in variance rather than spatial location. As an appendix, we provide an algorithm for estimating false negative and positive rates so these can be compensated for.
Subjects / Keywords
Graduation date

Spring 2010
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3H98ZQ6J
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Osmar Zaiane, Computing Science
Examining committee members and their departments
- Raymond Ng, Computer Science, University of British Columbia
- Mauricio Sacchi, Physics
- Joerg Sander, Computing Science
- Dale Schuurmans, Computing Science