Usage
  • 150 views
  • 119 downloads

Deep Learning-based Segmentation for Complex Scene Understanding

  • Author / Creator
    Ji, Wei
  • Deep learning-based segmentation plays a crucial role in computer and robot vision. Traditional approaches have predominantly relied on RGB (i.e., color) imagery, given its widespread availability and usage. However, the innate issues with color imagery, such as cluttered backgrounds and poor lighting, have significantly influenced the performance of existing segmentation methods under complex visual scenes. This thesis is an attempt seeking to advance the capabilities of deep learning-based segmentation for complex scene understanding by investigating additional imaging modalities.

    This thesis navigates two innovative avenues: 1) incorporating depth data to comprehensively understand the 3D spatial layout of scenes, and 2) using thermal infrared imagery to enhance vision under adverse lighting conditions. In the first avenue, we concentrate on RGB-depth segmentation and propose three novel strategies to improve segmentation efficacy by optimizing three key aspects of deep learning models, namely, network input, network architecture, and network supervision. These strategies involve calibrating the inherent bias in depth inputs for better scene layout depiction, developing advanced network architectures for improved multimodal information fusion and contextual comprehension, and harnessing depth map geometry for facilitating unsupervised RGB-D segmentation, thus reducing reliance on extensive human annotations. In the second avenue, we delve into RGB-Thermal (multispectral) segmentation, which is a relatively less-explored territory. We introduce the SemanticRT dataset, an extensive and large-scale resource for segmenting images under varied illumination conditions, and an innovative explicit complement modeling (ECM) framework to enhance modality-specific cue utilization and cross-modal feature fusion. Additionally, we pioneer the RGB-Thermal segmentation in the video domain, by presenting the first multispectral video semantic segmentation benchmark dataset - MVSeg, and developing an efficient MVNet baseline framework to jointly learn semantic representations from multispectral and temporal contexts.

    Extensive evaluations across ten segmentation datasets demonstrate that our proposed methodologies significantly outperform existing state-of-the-art solutions in handling challenging scenarios, heralding advancements in deep learning-based segmentation. This thesis also discusses the benefits and limitations of a recent foundational model - Segment Anything Model, and outlines some compelling issues and future research avenues within the field. Importantly, we advocate for open access, making our source codes, models and datasets publicly available to foster reproducibility and encourage collaborative research efforts.

  • Subjects / Keywords
  • Graduation date
    Fall 2024
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/r3-4v6d-kj50
  • License
    This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.