Finding non-Redundant, Statistically Significant Regions in High Dimensional Data: a Novel Approach to Projected and Subspace Clustering

Moise, Gabriela; Sander, Joerg

doi:doi:10.7939/R3J678Z1W

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Computing Science, Department of / Technical Reports (Computing Science)

Usage

228 views
547 downloads

Finding non-Redundant, Statistically Significant Regions in High Dimensional Data: a Novel Approach to Projected and Subspace Clustering

Author(s) / Creator(s)
- Moise, Gabriela
- Sander, Joerg
Technical report TR08-03. Projected and subspace clustering algorithms search for clusters of objects in subsets of attributes. Projected clustering computes several disjoint clusters, plus outliers, so that each cluster exists in its own subset of attributes. Subspace clustering enumerates clusters of objects in all subsets of attributes, and it produces many overlapping clusters. One problem of existing approaches is that their objectives are stated in a way that is not independent of the particular algorithm proposed to detect such clusters. A second problem is the definition of cluster density based on user-defined parameters, which makes it hard to assess whether the reported clusters are an artifact of the algorithm or they actually stand out in the data in a statistical sense. We propose a novel problem formulation that aims at extracting axis-parallel regions that stand out in the data in a statistical sense. The set of axis-parallel, statistically significant regions that exist in a given data set is typically highly redundant. Therefore, we formulate the problem of representing this set through a reduced, non-redundant set of axis-parallel, statistically significant regions as an optimization problem. Exhaustive search is not a viable solution to the optimization problem due to computational infeasibility. Consequently, we propose the approximation algorithm STATPC. Our comprehensive experimental evaluation shows that STATPC significantly outperforms existing projected and subspace clustering algorithms. | TRID-ID TR08-03
Date created

2008
Subjects / Keywords
- Projected clustering
- Subspace clustering
Type of Item

Report
DOI

https://doi.org/10.7939/R3J678Z1W
License

Attribution 3.0 International

Language
- English