Analyzing And Extracting Lists On The Web

  • Author / Creator
    Esteki, Afsaneh
  • The amount of information available on the Web is rapidly growing, and the need for extracting more useful and relevant data from this tremendously large source has become an interesting research challenge. Among various types of useful information that can be extracted, lists in particular are highly valuable as they provide groupings of related items. Such groupings are often interpretable and may present data in a more structured and condensed format that can be fed to other applications. In this thesis we explore some of the properties of lists embedded in web pages. Based on these properties, we propose a technique for classifying web pages into two categories: those containing lists, and the rest. Our results show that unlike some previous work, not all list-specific html tags are useful for identifying list-containing web pages. We also study the related problem of locating lists in a page. We cast the problem of detecting the boundaries of a list as a classification task and build a classifier using relevant page features. As the classifier produces a sequence of labels for each page, we examine some of the properties of this sequence and show how the accuracy of the detection can be further improved by rejecting some of the sequences that are less likely to indicate a list.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Computing Science
  • Supervisor / co-supervisor and their department(s)
    • Rafiei, Davood (Computing Science)
  • Examining committee members and their departments
    • Reformat, Marek (Electrical and Computer Engineering)