Annotating Web Tables Using Surface Text Patterns

  • Author / Creator
    Wang, Andong
  • While the World Wide Web has always been treated as an immense source of data, most information it provides is usually deemed unstructured and sometimes ambiguous, which in turn makes it unreliable. But the web also contains a relatively large number of structured data in the form of tables, which are constructed elaborately by human. Unfortunately, each relational table on the Web carries its own "schema''. The semantics of the columns and the relationships between the columns are often ill-defined; this makes any machine interpretation of the schema difficult and even sometimes impossible. We study the problem of annotating Web tables where given a table and a set of relevant documents, each describing or mentioning the element(s) of a row, the goal is to find surface text patterns that best describe the contexts for each column or combinations of the columns. The problem is challenging because of the number of potential patterns, the amount of noise in texts and the numerous ways rows can be mentioned. We develop a 2-stage framework where candidate patterns are generated based on sliding windows over texts in the first stage, and in the second stage, patterns are generalized and the redundant patterns are removed. Experiments are conducted to evaluate the quality of the annotations in comparison to human annotations.

  • Subjects / Keywords
  • Graduation date
    Spring 2016
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
  • Supervisor / co-supervisor and their department(s)
  • Examining committee members and their departments
    • Goebel, Randy (Computing Science)
    • Rafiei, Davood (Computing Science)
    • Barbosa, Denilson (Computing Science)