Usage
  • 8 views
  • 16 downloads

Investigating Feature Importance In Educational Data, Towards Handling Data Missingness in Classification Tasks

  • Author / Creator
    Obinwanne, Ndidi M
  • The problem of missing data is unavoidable in many research fields, especially in education where data can be missing for justifiable reasons. Missing data causes bias in analysis, and traditional methods like complete case analysis and single imputation are suboptimal yet typically used to address the problem. These methods place emphasis on achieving complete datasets prior to attempting classification tasks. The consequences are a reduction in sample size, loss of statistical power, and loss of representation in the data. In this work, we investigate a simple approach to missing data and build upon the multiple imputation method. This simple approach avoids imputation and instead concatenates information about which features have missing values in an education dataset. This concatenation approach deprioritizes the estimation of values in order to provide an alternative to data completion. As a first attempt to demonstrate this approach is feasible, we conducted an investigation of how these methods for handling missing data affect two neural network architecture’s ability to predict time to completion. To support this task, we first perform feature investigation using Structural Equation Modeling (SEM) to understand which features contain meaningful information. Results from this analysis showed that features containing data about student demography, high school performance details, English language skills, and university program details were important in understanding and explaining students’ time to completion. We used SEM-identified features as input to a prediction task implemented with versions of the data that relied on current simple imputation techniques (zero imputation [ZNet], mean imputation [Mean], and iterative imputation [Iterative]) and one that used the non-imputation technique concatenation (Cat). We conducted training on two neural network architectures - SmallNets and MediumNets - and compared model performance across techniques. The results show that the non-imputation technique Cat, achieved comparable or higher performance to that achieved by each of the three imputation techniques. Statistical tests in the SmallNets and MediumNets architecture showed differences existed between Cat and each of Mean and Iterative at different missingness levels. Cat outperformed Mean and Iterative with missingness levels at 10% and 80% in the SmallNets architecture. In the MediumNets architecture, it outperformed Mean and Iterative when missingness was at 40% and 80%. This indicates that Cat outperformed the imputation techniques Mean and Iterative at increasing missingness levels, and can perform better when used with a larger network. Our work provides a case study of the analysis and prediction of learner success even when data contains missingness, and it highlights that the simple concatenation approach might be sufficient for classification tasks with missing data.

  • Subjects / Keywords
  • Graduation date
    Spring 2024
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-9qsn-bj68
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.