Detecting and correcting typing errors in open-domain knowledge graphs using semantic representation of entities

  • Author / Creator
    Caminhas, Daniel D.
  • Large and accurate Knowledge Graphs (KGs) are often used as a source of structured knowledge in many natural language processing (NLP) tasks, including question-answering systems, conversational agents, information integration, named entity recognition, document ranking, among others.

    Various approaches for creating and updating KGs exist, each with its own advantages and disadvantages. Manually curated KGs can be very accurate, but require too much effort and tend to be small and domain-specific. Larger cross-domain KGs can be created automatically from unstructured or semi-structured data, but even the best methods today are error-prone. Given this trade-off between completeness and correctness, researchers have been attempting to refine KGs after they have been constructed, by adding missing knowledge or finding and correcting erroneous information.

    This thesis proposes a fully automatic method for detecting and correcting type assignments in an open-domain KG, provided that the entities in the KG are mentioned in a text corpus. Our approach consists of creating semantic representations (embeddings) of the entities in the KG that take into account how they are mentioned in the corpus and their properties in the KG itself, and using these embeddings as features for machine learning classifiers which are trained to distinguish entities of each type.

    To test our solution, we use DBpedia as the KG and Wikipedia as the text corpus, and we perform an extensive retrospective evaluation in which almost 15,000 entity-type pairs were verified by humans. Our results reveal several problems in the DBpedia ontology and led us to the conclusion that our method significantly outperforms alternative solutions for finding erroneous type assignment in knowledge graphs.

  • Subjects / Keywords
  • Graduation date
    Fall 2019
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.