Evaluation of High-Dimensional Word Embeddings using Cluster and Semantic Similarity Analysis

Atakishiyev,Shahin

doi:doi:10.7939/R3C824W03

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

478 views
641 downloads

Evaluation of High-Dimensional Word Embeddings using Cluster and Semantic Similarity Analysis

Author / Creator

Atakishiyev,Shahin
Vector representations of words, also known as a distributed representation of words or word embeddings, have recently attracted a lot of attention in computational semantics and natural language processing. They have been used in a variety of tasks such as named entity recognition, part-of-speech tagging, spoken language understanding, and several word similarity tasks. The numerical representations of words are mainly acquired either through co-occurrence of words (count-based) or neural networks (predictive). These two techniques represent words with different numerical distributions, and therefore, several studies have been led to estimate the quality of word embeddings for several downstream tasks. Our research sheds light on the evaluation of predictive and count-based word vectors using cluster and semantic similarity analysis. In particular, we have analyzed two crisp clustering algorithms- k-means, k-medoids and two fuzzy clustering algorithms - fuzzy C-means and fuzzy Gustafson-Kessel on several dimensional word vectors from the perspective of quality of word clustering. Moreover, we also measure the semantic similarity of word vectors in regard to a gold standard dataset to observe how different dimensional word vectors express a similarity to human-based judgment. The empirical results show that fuzzy C-means algorithm with adjusted parameter settings group words properly until around hundred dimensions but fails in higher dimensions. Also, fuzzy Gustafson-Kessel clustering was proved to be completely unstable even in around fifty dimensions. Crisp clustering methods, on the other hand, show impressive performance, and even surprisingly their performance becomes better as the dimensionality increases. Furthermore, the results indicated that higher dimensions represent words better in word similarity tasks based on the human judgment. Finally, we conclude that one word embedding method cannot be unanimously said to beat another one in all tasks, and different word vector architectures may produce different experimental results depending on a specific problem.
Subjects / Keywords
Graduation date

Spring 2018
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/R3C824W03
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Electrical and Computer Engineering
Specialization
- Software Engineering and Intelligent Systems
Supervisor / co-supervisor and their department(s)
- Reformat, Marek (Electrical and Computer Engineering)
Examining committee members and their departments
- Musilek, Petr (Electrical and Computer Engineering)
- Miller, James (Electrical and Computer Engineering)
- Reformat, Marek (Electrical and Computer Engineering)