Download the full-sized PDF of Indexing and Querying Natural Language TextDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Indexing and Querying Natural Language Text Open Access


Other title
Index Structures
Subtree Index
Natural Language Text
Tree Pattern Matching
Type of item
Degree grantor
University of Alberta
Author or creator
Chubak, Pirooz
Supervisor and department
Rafiei, Davood (Computing Science)
Examining committee member and department
Barbosa, Denilson (Computing Science)
Kondrak, Grzegorz (Computing Science)
Newman, John (Linguistics)
Ng, Raymond T. (Computer Science, UBC)
Department of Computing Science

Date accepted
Graduation date
Doctor of Philosophy
Degree level
Natural language text is a prominent source of representing and communicating information and knowledge. It is often desirable to search in granularities of text that are smaller than a document or to query the syntactic roles and relationships within syntactically annotated text sentences, often represented by parse trees. In this thesis, we study the problems of efficiently indexing and querying natural language text in the scenarios where (1) text is modelled as flat sequences of words and (2) text is modelled as collections of syntactically annotated trees. In the first scenario, we study some of the index structures that are capable of answering the class of queries referred to here as wild card queries and perform an analysis of their performance. Our experimental results on a large class of queries from different sources (including query logs and parse trees) and with various datasets reveal some of the performance barriers of these indexes. We present Word Permuterm Index (WPI) and show that it supports a wide range of wild card queries, is quick to construct and is highly scalable. Our experimental results comparing WPI to alternative methods on a wide range of wild card queries show a few orders of magnitude performance improvement for WPI while the memory usage is kept the same for all compared systems. In the second scenario, we study index structures and access methods that improve the performance of querying over syntactically parsed sentences. We propose a novel indexing scheme over unique subtrees as index keys. We also introduce the root-split coding scheme that concisely stores structural subtree information, making it possible to perform exact axes matching over subtrees. We theoretically study the properties of our coding and the limitations it imposes over query processing. Our extensive set of experiments show that root-split coding reduces the index size of a baseline index which stores the interval codes of all nodes by a factor of up to 5 (i.e. 80% reduction) and speeds up querying runtime by more than 6 times on average, when subtrees of sizes 1, . . . , 5 are indexed.
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
Citation for previous publication
Pirooz Chubak and Davood Rafiei, "Index Structures for Efficiently Searching Natural Language Text", Proceedings of the 19th ACM international conference on Information and knowledge management, 2010

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 840212
Last modified: 2015:10:12 15:18:29-06:00
Filename: Chubak_Pirooz_Spring 2012.pdf
Original checksum: 5a167f5c17f4f03492f0a3b755ce02cd
Well formed: true
Valid: true
File title: thesis.dvi
Page count: 120
Activity of users you follow
User Activity Date