Discrete Flag Xor - Memory and Time Efficient Vector Search

  • Author / Creator
    Tobias Renwick
  • Vector representations of text are heavily used in state of the art NLP tasks, however they require significant computing resources to build and use. In this work, we discuss and evaluate DFx, a system for representing vectors in a compressed form without compromising the similarity between those vectors, resulting in faster comparisons with less memory. Evaluation on a corpus of 1,000,000 tweets indicates that for the top 10 closest neighbors (KNN), dFX overlaps with cosine similarity for 95.4% of results while using as little as 12.5% of the original memory and completing the comparisons in 17.6% or less of the original time (assumes that the full size vector array fits in main memory). The KNN similarity rate decreases to only 90.2% when K = 1000. We further evaluate DFx on STS (Semantic Textual Similarity) benchmarks (STS-12 through 16) and demonstrate minimal performance loss (and occasional gains) due to discretization. We also explore the notion of paraphrases built through a corpus wide similarity search. The generated paraphrases are evaluated on ROUGE, and BLEU for corpus similarity and for semantic similarity by using BERT-Score.
    The generated phrases achieve scoring as follows: BLEU 60.74%, ROUGE-1 67.91%, ROUGE-2 41.21% and BERT-Score F1 result of 77.92% on the WikiAnswers paraphrase corpus. We believe these results indicate that DFx can be used in practical applications and introduce computational gains without significantly compromising accuracy.

  • Subjects / Keywords
  • Graduation date
    Spring 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.