Usage
  • 254 views
  • 364 downloads

Advantage of Integration in Big Data: Feature Generation in Multi-Relational Databases for Imbalanced Learning

  • Author / Creator
    Ahmed, Farrukh
  • Most data mining and machine learning techniques rely on a single flat table and assume balanced training data. However, most real-world applications comprise databases having multiple tables and imbalanced data. It becomes further complicated in the realm of Big Data where related information is spread over different data repositories. This work focuses on the automatic construction of a mining table by aggregating information from multiple local tables and additional data sources as external tables in a multi-relational database. Our work extends data aggregation techniques by exploring paths where a single table is traversed multiple times. The existing techniques do not generate attributes that exist on such paths or do not generate them efficiently. However, these paths contain useful past information. Our framework for Generating Attributes with Rolled Paths (GARP) also prevents leakage of the class information by avoiding features built after the knowledge of the class label. While generating new attributes, our system discovers certain patterns that provide useful insights for decision making. Experiments are performed on a transactional dataset from a U.S. consumer electronics retailer to predict product returns and identify reasons behind those returns. In addition, we augmented the retail dataset with Supplier information and Reviews to show the value of data integration. This dataset has the class imbalance problem, since product returns represent only 10% of the complete dataset. The results show that our technique improves classification accuracy and discovers new knowledge even in the presence of the class imbalance. Our scalability analysis shows that our approach can handle an increasing load of data in a linear fashion.

  • Subjects / Keywords
  • Graduation date
    Fall 2016
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/R30G3H832
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.