Usage
  • 52 views
  • 48 downloads

Large-scale Document Understanding with Knowledge Graphs for Medical Applications

  • Author / Creator
    Costello, Jeremy
  • We introduce the background of the natural language processing field, outlining the benefits and drawbacks of rule-based versus statistical methods. We present knowledge graphs as a way to integrate the explainability of rule-based methods and the power of statistical methods, large language models in particular. The accuracy of natural language processing methods is paramount in sensitive fields such as
    biomedicine. We aim to create a knowledge graph to help practitioners, caretakers, and patients affected by neurodevelopmental disorders.
    We give a background of knowledge graphs, topic modeling, and reinforcement learning. We talk about what knowledge graphs are, the creation process, and natural language processing methods for extracting data from text to populate a knowledge graph. We give a short history of topic modeling, followed by an outline of latent dirichlet allocation, dynamic topic models, topic model evaluation, and recent advances in neural topic modeling. We explain what reinforcement learning is, and outline the different approaches to reinforcement learning.
    We develop a pipeline for creating a knowledge graph on neurodevelopmental disorders. We scrape data from both professional academic sources and non-professional webpages, including finances and services for caretakers and patients affected by neurodevelopmental disorders. We take input from practitioners, caretakers, and patients during the knowledge graph creation process in order to generate a knowledge graph that is as useful as possible for non-professionals, in contrast to many existing medical knowledge graphs that only incorporate academic sources.
    To improve the topic modeling aspect of our knowledge graph creation pipeline, we develop a new topic model using reinforcement learning. We make additional improvements to the topic model, including modernizing the neural network architecture, weighting the ELBO loss, and using contextual embeddings. Our unsupervised model outperforms all other unsupervised models and performs on par with or better than most models using supervised labeling. We conduct an ablation study to determine which changes to our model are the most important.
    We look to directly extract triples from text using large language models. With the assistance of volunteers, we create two new data sets about FragileX syndrome: one for named-entity recognition and one for relation extraction. We compare a model trained on our FrageileX data set to a model trained on a less specific data set. We find strengths and weaknesses of both models. Our method is likely outdated due to the rapid pace of advancements in large language models.
    We give a short concluding statement summarizing what we have done, and provide some brief thoughts on the future of natural language processing for biomedical applications.

  • Subjects / Keywords
  • Graduation date
    Spring 2024
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-b549-9m42
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.