Usage
  • 141 views
  • 146 downloads

Advancing Automated Depression Diagnosis: Multimodal Analysis and a Novel Clinical Interview Corpus with Guidelines for Reproducibility and Generalizability

  • Author / Creator
    Mao, Kaining
  • Depression is a major public health issue globally and is challenging to diagnose and treat in the early clinical stage due to the lack of understanding of the pathogenic mechanism. Traditional diagnosis heavily relies on physicians' experience and is subject to bias. With the advancement of smart devices and artificial intelligence, understanding how depression associates with daily behaviors can be beneficial for early-stage diagnosis and reduce the likelihood of clinical mistakes as well as physician bias. In this thesis, the author proposes an attention-based multimodality speech and text representation for depression prediction using the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) dataset.

    First, the author conducted a review of studies from the past decade that utilized speech, text, and facial expression analysis to detect depression. The review includes information on the number of participants, techniques used to assess clinical outcomes, speech-eliciting tasks, machine learning algorithms, metrics, and other important discoveries for each study. A database has been created containing the query results and an overview of how different features are used to detect depression.
    
    Furthermore, the author's model is trained to estimate the depression severity of participants using acoustic and semantic features. For the audio modality, the author uses the COVAREP features provided by the dataset and employs a Bi-LSTM followed by a Time-distributed CNN. For the text modality, the author uses GloVe to perform word embeddings and feeds the embeddings into the Bi-LSTM network. The results show that both audio and text models perform well on the depression severity estimation task, with the best sequence level $F_1$ score of 0.9870 and patient-level $F_1$ score of 0.9074 for the audio model over five classes (healthy, mild, moderate, moderately severe, and severe), as well as sequence level $F_1$ score of 0.9709 and patient-level $F_1$ score of 0.9245 for the text model over five classes. Results are similar for the multimodality fused model, with the highest $F_1$ score of 0.9580 on the patient-level depression detection task over five classes.
    
    In addition, the author presents a novel multimodal corpus comprising interviews conducted with clinically depressed patients, gathered directly from a psychiatric hospital. The dataset contains 113 interview recordings with 52 healthy and 61 depressed patients, and each data sample is annotated by experienced physicians, generating a binary label of depression versus healthy and a MADRS score. The author built baseline models to detect and predict depression presence and level, and the decision-making process of the model is investigated and illustrated.
    
    In summary, the author conducts a review of studies utilizing speech, text, and facial expression analysis to detect depression and provides guidelines for collecting data and training machine learning models to ensure reproducibility and generalizability across different contexts. The author also proposes an attention-based multimodality representation, integrating speech and text modalities, for predicting depression. Additionally, they present a novel multimodal corpus of clinical interviews focused on depression. The author's work contributes to the advancement of automated depression diagnosis and treatment, which is critical in addressing the global public health issue of depression.
    

  • Subjects / Keywords
  • Graduation date
    Fall 2024
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/r3-kdwf-zc25
  • License
    This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.