ERA

Download the full-sized PDF of Probabilistic Graphical Model for Entity ResolutionDownload the full-sized PDF

Analytics

Share

Permanent link (DOI): https://doi.org/10.7939/R3QJ78C0S

Download

Export to: EndNote  |  Zotero  |  Mendeley

Communities

This file is in the following communities:

Graduate Studies and Research, Faculty of

Collections

This file is in the following collections:

Theses and Dissertations

Probabilistic Graphical Model for Entity Resolution Open Access

Descriptions

Other title
Subject/Keyword
Record Linkage
Probabilistic Graphical Model
Entity Resolution
Bayesian Network
Type of item
Thesis
Degree grantor
University of Alberta
Author or creator
He,Ziyue
Supervisor and department
Pedrycz,Witold (Electrical and Computer Engineering)
Examining committee member and department
Kuru,Ergun (Civil and Environmental Engineering)
Pedrycz,Witold (Electrical and Computer Engineering)
Reformat,Marek (Electrical and Computer Engineering)
Department
Department of Electrical and Computer Engineering
Specialization
Software Engineering and Intelligent Systems
Date accepted
2017-09-14T10:50:42Z
Graduation date
2017-11:Fall 2017
Degree
Master of Science
Degree level
Master's
Abstract
The task of entity resolution, also known as record linkage, is to find out which records refer to the same entity across several datasets and link them together. In real applications, record linkage is widely used in multiple fields, such as business, healthcare, criminal investigation, among others. Up to now, many techniques have been developed to accomplish this task. Accuracy and efficiency are the two most important factors when quantifying the quality of a record linkage approach. However, finding the exactly matched records is still a big challenge since several types of noise/errors are always present in real-world data. These errors (noise) could be phonetical, typographical, and may be the result of optical character recognition (OCR). For years, a significant number of comparison methods have been proposed to describe the level of similarity (similarity score) between identity fields among pairs of records. Nevertheless, a proper selection of comparison methods, appropriate identity fields, suitable classifiers, and the determination of the thresholds’ values remains a challenging problem. The objective of this dissertation is to design and analyze a probabilistic graphical model (PGM) to realize a proper record linkage task. In this study, a Bayesian network, which is an example of PGM, is used to calculate the probabilities of being matched among record pairs to decide if these can be linked or not. Furthermore, several comparison methods and fields are considered for this model. For each combination of a comparison method and a field, a similarity score is obtained. With these scores, along with two predefined thresholds, a decision can be made to determine whether a record pair is a match, not a match, or a probable match which would need a closer inspection (clerical review). Not every comparison method or field is equally relevant in practice. Therefore, to describe the roles of the selected comparison methods and fields, weights are added to the Bayesian network. These weights are previously optimized by a modified supervised gradient descent learning scheme. Synthetic datasets with different levels of noise are used to perform the experiments. The experimental studies show that the proposed record linkage model can calculate the matching probabilities of records (that could hypothetically be matched) in an accurate and efficient manner. Furthermore, the proposed model can offer an insight on which comparison methods and fields are more significant for a correct record linkage.
Language
English
DOI
doi:10.7939/R3QJ78C0S
Rights
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication

File Details

Date Uploaded
Date Modified
2017-09-14T16:50:43.026+00:00
Audit Status
Audits have not yet been run on this file.
Characterization
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 2104802
Last modified: 2017:11:08 16:47:22-07:00
Filename: He_Ziyue_201709_MSc.pdf
Original checksum: 33a7d516b0547c52da1429ef50674340
Activity of users you follow
User Activity Date