Probabilistic Graphical Model for Entity Resolution

  • Author / Creator
  • The task of entity resolution, also known as record linkage, is to find out which records refer to the same entity across several datasets and link them together. In real applications, record linkage is widely used in multiple fields, such as business, healthcare, criminal investigation, among others. Up to now, many techniques have been developed to accomplish this task. Accuracy and efficiency are the two most important factors when quantifying the quality of a record linkage approach. However, finding the exactly matched records is still a big challenge since several types of noise/errors are always present in real-world data. These errors (noise) could be phonetical, typographical, and may be the result of optical character recognition (OCR). For years, a significant number of comparison methods have been proposed to describe the level of similarity (similarity score) between identity fields among pairs of records. Nevertheless, a proper selection of comparison methods, appropriate identity fields, suitable classifiers, and the determination of the thresholds’ values remains a challenging problem. The objective of this dissertation is to design and analyze a probabilistic graphical model (PGM) to realize a proper record linkage task. In this study, a Bayesian network, which is an example of PGM, is used to calculate the probabilities of being matched among record pairs to decide if these can be linked or not. Furthermore, several comparison methods and fields are considered for this model. For each combination of a comparison method and a field, a similarity score is obtained. With these scores, along with two predefined thresholds, a decision can be made to determine whether a record pair is a match, not a match, or a probable match which would need a closer inspection (clerical review). Not every comparison method or field is equally relevant in practice. Therefore, to describe the roles of the selected comparison methods and fields, weights are added to the Bayesian network. These weights are previously optimized by a modified supervised gradient descent learning scheme. Synthetic datasets with different levels of noise are used to perform the experiments. The experimental studies show that the proposed record linkage model can calculate the matching probabilities of records (that could hypothetically be matched) in an accurate and efficient manner. Furthermore, the proposed model can offer an insight on which comparison methods and fields are more significant for a correct record linkage.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Electrical and Computer Engineering
  • Specialization
    • Software Engineering and Intelligent Systems
  • Supervisor / co-supervisor and their department(s)
    • Pedrycz,Witold (Electrical and Computer Engineering)
  • Examining committee members and their departments
    • Pedrycz,Witold (Electrical and Computer Engineering)
    • Reformat,Marek (Electrical and Computer Engineering)
    • Kuru,Ergun (Civil and Environmental Engineering)