Usage
  • 50 views
  • 32 downloads

Improving Breast Cancer Biomarker Predictors from Morphology via Corresponding Gaussian Processes

  • Author / Creator
    Hosseini Akbarnejad, Amir Hossein
  • Biomarkers for cancer are tests performed on tumoral tissue which extract information from genes (DNA, deoxyribonucleic acid), product of genes (RNA, ribonucleic acid) and proteins. The information obtained from biomarkers (abnormal amount, strutural defect, etc.) is the basis for breast cancer diagnosis and treatment. Routinely performed Biomarker tests like IHC (immunohistochemistry which reveals protein expression) and FISH (Fluorescence in situ Hybridization which reveals DNA expression) are time-consuming and expensive and are not available in many regions of the world. With the availability of digital microscopy images, several attempts have been made to apply machine learning to predict biomarker information merely from morphology (i.e. from H&E-stained histopathology images). Doing so accurately, if achievable, can solve the aforementioned issues of the biomarker tests.

    In the aforementioned task current machine learning methods have low prediction performances (around 80 in terms of AUC, area under the curve) because of unavailability of large datasets. To tackle this issue we created an in-house dataset called IHC4BC containing more than 180,000 images. Thanks to the large dataset, we showed that standard machine learning methods can achieve around 90 AUCs. Moreover, we showed that weakly-supervised training with patient-level labels is not successful and the acquired patch-level labels in our proposed IHC4BC dataset have been essential to achieve high prediction performances.

    Despite the good prediction performance of the obtained classifiers, our experiments showed that a high patch-level prediction performance does not mean that a method has successfully localized all relevant tissue regions, with "relevant" defined as tissue regions in which a particular gene is over-expressed, \eg{}, HER2 (the Human Epidermal Growth Factor Receptor 2) which is found in approximately 20\% of all breast cancers and associated with a sinister outcome if not identified and properly treated. Given this limitation of methods in localizing relevant tissue regions, we manually marked near 900K HER2-positive points on the HER2 subset of the IHC4BC dataset. These manually-marked points were used to train a strongly-supervised classifier with pixel-level labels and a state-of-the-art localization method. In our analysis automatic localization is competitive to pixel-level supervision, and - intriguingly - sometimes even works better. Importantly, our analysis motivates the adoption of automatic localization with, e.g., 3K by 3K level labels specially for heterogeneous biomarkers for whom acquiring pixel-level label is not possible.

    Although the proposed IHC4BC dataset enabled the classifiers to achieve high prediction performances, there are failure cases and there are a lot of research questions to be answered. To this end, we proposed a method called GPEX (Gaussian Processes for EXplainning ANNs) to interpret artificial neural networks: it provides reliable explanations by performing knowledge distillation between artificial neural networks and GPs (Gaussian processes). Using our proposed GPEX we obtain Gaussian processes which are equivalent to trained neural networks. We showed that the obtained GPs can provide insight about the underlying mechanism of neural network classifiers trained on publicly available image datasets.
    An important goal is to identify tissue types which are missing in our in-house IHC4BC dataset and to add those images to the dataset. This goal is fulfilled in the setting known as active learning where a pool of unlabeled instances are available, and an active learner picks up some instances in the pool and asks for their labels. The active learner is supposed to pick up pool instances which are the most beneficial to a predictor. We applied GPEX to the HER2 subsubet of our IHC4BC dataset, and showed that in Bayesian active learning the GPs obtained by our proposed GPEX are a better choice than the commonly-used dropout and can improve a state-of-the-art Bayesian active learner.

  • Subjects / Keywords
  • Graduation date
    Fall 2024
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/r3-jhz8-xe71
  • License
    This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.