Analyzing And Extracting Lists On The Web

Esteki, Afsaneh

doi:doi:10.7939/R3PV6BG1V

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

289 views
257 downloads

Analyzing And Extracting Lists On The Web

Author / Creator

Esteki, Afsaneh
The amount of information available on the Web is rapidly growing, and the need for extracting more useful and relevant data from this tremendously large source has become an interesting research challenge. Among various types of useful information that can be extracted, lists in particular are highly valuable as they provide groupings of related items. Such groupings are often interpretable and may present data in a more structured and condensed format that can be fed to other applications.
In this thesis we explore some of the properties of lists embedded in web pages.
Based on these properties, we propose a technique for classifying web pages into
two categories: those containing lists, and the rest. Our results show that unlike some previous work, not all list-specific html tags are useful for identifying list-containing
web pages. We also study the related problem of locating lists in a page. We cast the problem of detecting the boundaries of a list as a classification task and build a classifier using relevant page features. As the classifier produces a sequence of labels for each page, we examine some of the properties of this sequence and show how the accuracy of the detection can be further improved by rejecting some of the sequences that are less likely to indicate a list.
Subjects / Keywords
- Lists
- Information Extraction
Graduation date

Fall 2013
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/R3PV6BG1V
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Rafiei, Davood (Computing Science)
Examining committee members and their departments
- Reformat, Marek (Electrical and Computer Engineering)