Extracting the Lowest-Frequency Words: Pitfalls and Possibilities

Weeber, M.; Baayen, R.H.; Vos, R.

doi:doi:10.7939/R3QN5ZN98

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Linguistics, Department of / Research Publications (Linguistics)

Usage

218 views
255 downloads

Extracting the Lowest-Frequency Words: Pitfalls and Possibilities

Author(s) / Creator(s)
In a medical information extraction system, we use common word association techniques to extract side-effect-related terms. Many of these terms have a frequency of less than five. Standard word-association-based applications disregard the lowest-frequency words, and hence disregard useful information. We therefore devised an extraction system for the full word frequency range. This system computes the significance of association by the log-likelihood ratio and Fisher’s exact test. The output of the system shows a recurrent, corpus-independent pattern in both recall and the number of significant words. We will explain these patterns by the statistical behavior of the lowest-frequency words.We used Dutch verb-particle combinations as a second and independent collocation extraction application to illustrate the generality of the observed phenomena. We will conclude that a) word-association-based extraction systems can be enhanced by also considering the lowest-frequency words, b) significance levels should not be fixed but adjusted for the optimal window size, c) hapax legomena, words occurring only once, should be disregarded a priori in the statistical analysis, and d) the distribution of the targets to extract should be considered in combination with the extraction method.
Date created

2000
Subjects / Keywords
- Contingency-tables
- Fexact
Type of Item

Article (Published)
DOI

https://doi.org/10.7939/R3QN5ZN98

Language
- English
Citation for previous publication
- Weeber, M., Vos, R., & Baayen, R. H. (2000). Extracting the Lowest-Frequency Words: Pitfalls and Possibilities. Computational Linguistics, 26(3), 301-317.