Towards an Accurate, Robust, and Scalable Named Entity Disambiguation System

  • Author / Creator
    Guo, Zhaochen
  • Knowledge bases (KBs), repositories consisting of entities, facts about entities, and relations between entities, are a vital component for many tasks in artificial intelligence and natural language processing such as semantic search and question answering. Named Entity Disambiguation (NED), the task of disambiguating mentions of named entities in a textual document by linking them to the actual entities in a KB, enables expanding or correcting the KB with facts extracted from documents – a task called Knowledge Base Population. This thesis focuses on the NED task with the goal of building an accurate, robust, and scalable NED system. We first propose a graph-based approach that collectively disambiguates mentions of entities in a given document, with the assumption that entities mentioned in a document are semantically related under a single topic. Our approach uses a carefully-curated disambiguation graph built from a KB, and applies personalized random walks on the graph to compute semantic representations of entities, which are used to measure semantic relatedness and disambiguate named entities. We then improve the robustness of our NED approach with a supervised learning to rank algorithm using publicly available datasets. We find that the public benchmarks, mainly from news articles, are biased towards well-known entities and not representative to evaluate the robustness of an NED approach. Thus we develop a framework for deriving new benchmarks and construct two benchmarks with varying disambiguation difficulties from two large corpora (Wikipedia and ClueWeb) for the evaluation of robustness. Finally, to address the scalability issue of our NED approach, we explore various features from entity graphs, contextual texts, and document corpora that can be efficiently pre-computed offline. Instead of performing random walks on online constructed graphs, we use a set of selected landmark nodes from entity graphs to compute the semantic representations of entities. We also explore features derived from the describing documents and associated categories of entities. By precomputing all these features offline, our approach can reduce the computing and memory resources to improve the efficiency and scale out the NED system. The evaluation shows that our approach is very competitive and efficient compared to previous NED approaches.

  • Subjects / Keywords
  • Graduation date
    Fall 2018
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.