Download the full-sized PDF
Permanent link (DOI): https://doi.org/10.7939/R3PV6BG1V
This file is in the following communities:
|Graduate Studies and Research, Faculty of|
This file is in the following collections:
|Theses and Dissertations|
Analyzing And Extracting Lists On The Web Open Access
- Other title
- Type of item
- Degree grantor
University of Alberta
- Author or creator
- Supervisor and department
Rafiei, Davood (Computing Science)
- Examining committee member and department
Reformat, Marek (Electrical and Computer Engineering)
Department of Computing Science
- Date accepted
- Graduation date
Master of Science
- Degree level
The amount of information available on the Web is rapidly growing, and the need for extracting more useful and relevant data from this tremendously large source has become an interesting research challenge. Among various types of useful information that can be extracted, lists in particular are highly valuable as they provide groupings of related items. Such groupings are often interpretable and may present data in a more structured and condensed format that can be fed to other applications.
In this thesis we explore some of the properties of lists embedded in web pages.
Based on these properties, we propose a technique for classifying web pages into
two categories: those containing lists, and the rest. Our results show that unlike some previous work, not all list-specific html tags are useful for identifying list-containing
web pages. We also study the related problem of locating lists in a page. We cast the problem of detecting the boundaries of a list as a classification task and build a classifier using relevant page features. As the classifier produces a sequence of labels for each page, we examine some of the properties of this sequence and show how the accuracy of the detection can be further improved by rejecting some of the sequences that are less likely to indicate a list.
- Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
- Citation for previous publication
- Date Uploaded
- Date Modified
- Audit Status
- Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 931565
Last modified: 2015:10:12 16:41:48-06:00
Filename: Esteki_Afsaneh_Fall 2013.pdf
Original checksum: f56e23b79f8c1e3ef3b14edde6bf33b2
Well formed: false
Status message: No document catalog dictionary offset=0