Download the full-sized PDF of High-dimensional data mining: subspace clustering, outlier detection and applications to classificationDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

High-dimensional data mining: subspace clustering, outlier detection and applications to classification Open Access


Other title
Subspace outlier detection
Error estimation
Subspace clustering
Outlier detection
Type of item
Degree grantor
University of Alberta
Author or creator
Foss, Andrew
Supervisor and department
Osmar Zaiane, Computing Science
Examining committee member and department
Mauricio Sacchi, Physics
Raymond Ng, Computer Science, University of British Columbia
Joerg Sander, Computing Science
Dale Schuurmans, Computing Science
Department of Computing Science

Date accepted
Graduation date
Doctor of Philosophy
Degree level
Data mining in high dimensionality almost inevitably faces the consequences of increasing sparsity and declining differentiation between points. This is problematic because we usually exploit these differences for approaches such as clustering and outlier detection. In addition, the exponentially increasing sparsity tends to increase false negatives when clustering. In this thesis, we address the problem of solving high-dimensional problems using low-dimensional solutions. In clustering, we provide a new framework MAXCLUS for finding candidate subspaces and the clusters within them using only two-dimensional clustering. We demonstrate this through an implementation GCLUS that outperforms many state-of-the-art clustering algorithms and is particularly robust with respect to noise. It also handles overlapping clusters and provides either `hard' or `fuzzy' clustering results as desired. In order to handle extremely high dimensional problems, such as genome microarrays, given some sample-level diagnostic labels, we provide a simple but effective classifier GSEP which weights the features so that the most important can be fed to GCLUS. We show that this leads to small numbers of features (e.g. genes) that can distinguish the diagnostic classes and thus are candidates for research for developing therapeutic applications. In the field of outlier detection, several novel algorithms suited to high-dimensional data are presented (T*ENT, T*ROF, FASTOUT). It is shown that these algorithms outperform the state-of-the-art outlier detection algorithms in ranking outlierness for many datasets regardless of whether they contain rare classes or not. Our research into high-dimensional outlier detection has even shown that our approach can be a powerful means of classification for heavily overlapping classes given sufficiently high dimensionality and that this phenomenon occurs solely due to the differences in variance among the classes. On some difficult datasets, this unsupervised approach yielded better separation than the very best supervised classifiers and on other data, the results are competitive with state-of-the-art supervised approaches.\kern-1pt The elucidation of this novel approach to classification opens a new field in data mining, classification through differences in variance rather than spatial location. As an appendix, we provide an algorithm for estimating false negative and positive rates so these can be compensated for.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 1557343
Last modified: 2016:08:04 02:59:07-06:00
Filename: afthesis27.1.2010.pdf
Original checksum: 0cadcf30fa9dfaef0a0ac3636c1f2cc2
Well formed: true
Valid: true
Page count: 148
Activity of users you follow
User Activity Date