Indexing and Querying Natural Language Text

Chubak, Pirooz

doi:doi:10.7939/R3GQ6Q

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

422 views
572 downloads

Indexing and Querying Natural Language Text

Author / Creator

Chubak, Pirooz
Natural language text is a prominent source of representing and communicating information
and knowledge. It is often desirable to search in granularities of text that are smaller than
a document or to query the syntactic roles and relationships within syntactically annotated
text sentences, often represented by parse trees. In this thesis, we study the problems of
efficiently indexing and querying natural language text in the scenarios where (1) text is
modelled as flat sequences of words and (2) text is modelled as collections of syntactically
annotated trees.
In the first scenario, we study some of the index structures that are capable of answering
the class of queries referred to here as wild card queries and perform an analysis of their
performance. Our experimental results on a large class of queries from different sources
(including query logs and parse trees) and with various datasets reveal some of the performance
barriers of these indexes. We present Word Permuterm Index (WPI) and show that
it supports a wide range of wild card queries, is quick to construct and is highly scalable.
Our experimental results comparing WPI to alternative methods on a wide range of wild
card queries show a few orders of magnitude performance improvement for WPI while the
memory usage is kept the same for all compared systems.
In the second scenario, we study index structures and access methods that improve the
performance of querying over syntactically parsed sentences. We propose a novel indexing
scheme over unique subtrees as index keys. We also introduce the root-split coding scheme
that concisely stores structural subtree information, making it possible to perform exact
axes matching over subtrees. We theoretically study the properties of our coding and the
limitations it imposes over query processing. Our extensive set of experiments show that
root-split coding reduces the index size of a baseline index which stores the interval codes
of all nodes by a factor of up to 5 (i.e. 80% reduction) and speeds up querying runtime by
more than 6 times on average, when subtrees of sizes 1, . . . , 5 are indexed.
Subjects / Keywords
Graduation date

Spring 2012
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3GQ6Q
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Rafiei, Davood (Computing Science)
Examining committee members and their departments
- Ng, Raymond T. (Computer Science, UBC)
- Barbosa, Denilson (Computing Science)
- Kondrak, Grzegorz (Computing Science)
- Newman, John (Linguistics)