ERA

Download the full-sized PDF of Analyzing And Extracting Lists On The WebDownload the full-sized PDF

Analytics

Share

Permanent link (DOI): https://doi.org/10.7939/R3PV6BG1V

Download

Export to: EndNote  |  Zotero  |  Mendeley

Communities

This file is in the following communities:

Graduate Studies and Research, Faculty of

Collections

This file is in the following collections:

Theses and Dissertations

Analyzing And Extracting Lists On The Web Open Access

Descriptions

Other title
Subject/Keyword
Lists
Information Extraction
Type of item
Thesis
Degree grantor
University of Alberta
Author or creator
Esteki, Afsaneh
Supervisor and department
Rafiei, Davood (Computing Science)
Examining committee member and department
Reformat, Marek (Electrical and Computer Engineering)
Department
Department of Computing Science
Specialization

Date accepted
2013-09-28T14:25:30Z
Graduation date
2013-11
Degree
Master of Science
Degree level
Master's
Abstract
The amount of information available on the Web is rapidly growing, and the need for extracting more useful and relevant data from this tremendously large source has become an interesting research challenge. Among various types of useful information that can be extracted, lists in particular are highly valuable as they provide groupings of related items. Such groupings are often interpretable and may present data in a more structured and condensed format that can be fed to other applications. In this thesis we explore some of the properties of lists embedded in web pages. Based on these properties, we propose a technique for classifying web pages into two categories: those containing lists, and the rest. Our results show that unlike some previous work, not all list-specific html tags are useful for identifying list-containing web pages. We also study the related problem of locating lists in a page. We cast the problem of detecting the boundaries of a list as a classification task and build a classifier using relevant page features. As the classifier produces a sequence of labels for each page, we examine some of the properties of this sequence and show how the accuracy of the detection can be further improved by rejecting some of the sequences that are less likely to indicate a list.
Language
English
DOI
doi:10.7939/R3PV6BG1V
Rights
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
Citation for previous publication

File Details

Date Uploaded
Date Modified
2014-05-02T17:45:54.241+00:00
Audit Status
Audits have not yet been run on this file.
Characterization
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 931565
Last modified: 2015:10:12 16:41:48-06:00
Filename: Esteki_Afsaneh_Fall 2013.pdf
Original checksum: f56e23b79f8c1e3ef3b14edde6bf33b2
Well formed: false
Valid: false
Status message: No document catalog dictionary offset=0
Activity of users you follow
User Activity Date