A Synthetic Data Generator for Clustering and Outlier Analysis

Pei, Yaling; Zaiane, Osmar

doi:doi:10.7939/R3B23S

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Computing Science, Department of / Technical Reports (Computing Science)

Usage

3079 views
2340 downloads

A Synthetic Data Generator for Clustering and Outlier Analysis

Author(s) / Creator(s)
- Pei, Yaling
- Zaiane, Osmar
We present a distribution-based and transformation-based approach to synthetic data generation and demonstrate that the approach is very efficient in generating different types of multi-dimensional numerical datasets for data clustering and outlier analysis. We developed a data generating system that is able to systematically create testing datasets based on user's requirements such as the number of points, the number of clusters, the size, shapes and locations of clusters, and the density level of either cluster data or noise/outliers in a dataset. Two standard probability distributions are considered in data generation. One is uniform distribution and the other is normal distribution. Since outlier detection, especially local outlier detection, is conducted in the context of clusters of a dataset, our synthetic data generator is suitable for both clustering and outlier analysis. In addition, the data format has been carefully designed so that generated data can be visualized not only by our system but also by some popular statistical rendering tools such as statCrunch and statPoint that display data with standard statistical graphical approaches. To our knowledge, our system is probably the first synthetic data generation system that systematically generates datasets for evaluating the clustering and outlier analysis algorithms. Being an object-oriented system, the current data generator can be easily integrated into other data analysis systems. | TRID-ID TR06-15
Date created

2006
Subjects / Keywords
- Synthetic data generation, cluster
Type of Item

Report
DOI

https://doi.org/10.7939/R3B23S
License

Attribution 3.0 International

Language
- English