Computational Decipherment of Unknown Scripts

  • Author / Creator
    Hauer, Bradley
  • Algorithmic decipherment is a prime example of a truly unsupervised problem. This thesis presents several algorithms developed for the purpose of decrypting unknown alphabetic scripts representing unknown languages. We assume that symbols in scripts which contain no more than a few dozen unique characters roughly correspond to the phonemes of a language, and model such scripts as monoalphabetic substitution ciphers. We further allow that an unknown transposition scheme could have been applied to the enciphered text, resulting in arbitrary scrambling of letters within words (anagramming). We also consider the possibility that the underlying script is an abjad, in which only consonants are explicitly represented. Our decryption system is composed of three steps. The first step in the decipherment process is the identification of the encrypted language. We propose three methods for determining the source language of a document enciphered with a monoalphabetic substitution cipher. The best method achieves 97% accuracy on 380 languages. The second step is to map each symbol of the ciphertext to the corresponding letter in the identified language. We propose a novel approach to deciphering short monoalphabetic substitution ciphers which combines both character-level and word-level language models. Our method achieves a significant improvement over the state of the art on a benchmark suite of short ciphers. The third step is to decode the resulting anagrams into readable text, which may involve the recovery of unwritten vowels. Our approach obtains an average decryption word accuracy of 93% on a set of 50 ciphertexts in 5 languages. Finally, we apply our new techniques to the Voynich manuscript, a centuries-old document written in an unknown script, which has resisted decipherment despite decades of study.

  • Subjects / Keywords
  • Graduation date
    Spring 2016
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
  • Supervisor / co-supervisor and their department(s)
  • Examining committee members and their departments
    • Kondrak, Greg (Computing Science)
    • Hayward, Ryan (Computing Science)
    • Stewart, Selina (History and Classics)