Extracting Information Networks from Text

de Sa Mesquita, Filipe

doi:doi:10.7939/R37T09

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

356 views
1078 downloads

Extracting Information Networks from Text

Author / Creator

de Sa Mesquita, Filipe
This work is concerned with the problem of extracting structured information networks from a text corpus. The nodes of the network are recognizable entities, typically people, locations, or organizations, while the edges denote relations among such entities. We use state-of-the-art natural language processing tools to identify the entities and focus on extracting instances of relations. The first relation extraction approaches were supervised and relation-specific, producing new instances of relations known a priori. While effective, this paradigm is not applicable in cases where the relations are not known a priori or when the number of relations is high. Recently, open relation extraction (ORE) techniques were developed to extract instances of arbitrary relations while requiring fewer training examples. Because of their appeal to applications that rely on large-scale relation extraction, a major requirement for ORE methods is low computational cost. Several ORE approaches have been proposed recently, covering a wide range of NLP machinery, from "shallow" (e.g., part-of-speech tagging) to "deep" (e.g., semantic role labeling -- SRL), thus raising the question of what is the trade-off between NLP depth (and associated computational cost) and effectiveness. We study this trade-off in depth, and make the following contributions. First, we introduce a fair and objective benchmark for this task, and report on an experimental comparison of 11 ORE methods shedding some light on the state-of-the-art. Next, we propose rule-based methods that achieve higher effectiveness at lower computational cost than the previous best approaches. Also, we address the problem of extracting nested relations (i.e., relations that accept relation instances as arguments) and n-ary relations (i.e., relations with n>2 arguments). Previously, all methods for extracting these types of relations were based on SRL, which can be up to 1000 times slower than methods based on shallow NLP. Finally, we describe an elegant solution that starts with shallow extraction methods and decides, on-the-fly and on a per-sentence basis, whether or not to deploy deeper extraction methods based on dependency parsing and SRL. Our solution prioritizes extra computational resources for sentences describing relation instances that are likely to be extracted by deeper methods. We show experimentally that this solution can achieve much higher effectiveness at a fraction of the cost of SRL.
Subjects / Keywords
Graduation date

Spring 2015
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R37T09
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Barbosa, Denilson (Computing Science)
Examining committee members and their departments
- Rafiei, Davood (Computing Science)
- Goebel, Randolph (Computing Science)
- Carenini, Giuseppe (Computer Science, UBC)
- Reformat, Marek (Electrical and Computer Engineering)