Download the full-sized PDF of Advantage of Integration in Big Data: Feature Generation in Multi-Relational Databases for Imbalanced LearningDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Advantage of Integration in Big Data: Feature Generation in Multi-Relational Databases for Imbalanced Learning Open Access


Other title
Feature Generation
Big Data
Data Integration
Imbalanced Learning
Type of item
Degree grantor
University of Alberta
Author or creator
Ahmed, Farrukh
Supervisor and department
Zaiane, Osmar (Computing Science)
Examining committee member and department
Davood Rafiei (Computing Science)
Reidar Hagtvedt (Business)
Department of Computing Science

Date accepted
Graduation date
2016-06:Fall 2016
Master of Science
Degree level
Most data mining and machine learning techniques rely on a single flat table and assume balanced training data. However, most real-world applications comprise databases having multiple tables and imbalanced data. It becomes further complicated in the realm of Big Data where related information is spread over different data repositories. This work focuses on the automatic construction of a mining table by aggregating information from multiple local tables and additional data sources as external tables in a multi-relational database. Our work extends data aggregation techniques by exploring paths where a single table is traversed multiple times. The existing techniques do not generate attributes that exist on such paths or do not generate them efficiently. However, these paths contain useful past information. Our framework for Generating Attributes with Rolled Paths (GARP) also prevents leakage of the class information by avoiding features built after the knowledge of the class label. While generating new attributes, our system discovers certain patterns that provide useful insights for decision making. Experiments are performed on a transactional dataset from a U.S. consumer electronics retailer to predict product returns and identify reasons behind those returns. In addition, we augmented the retail dataset with Supplier information and Reviews to show the value of data integration. This dataset has the class imbalance problem, since product returns represent only 10% of the complete dataset. The results show that our technique improves classification accuracy and discovers new knowledge even in the presence of the class imbalance. Our scalability analysis shows that our approach can handle an increasing load of data in a linear fashion.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 1429030
Last modified: 2016:11:16 13:05:37-07:00
Filename: Ahmed_Farrukh_201608_MSc.pdf
Original checksum: 7a3abfb77254f6a8b69e551e8daa89ea
Well formed: true
Valid: true
Page count: 83
Activity of users you follow
User Activity Date