ERA

Download the full-sized PDF of Computational Decipherment of Unknown ScriptsDownload the full-sized PDF

Analytics

Share

Permanent link (DOI): https://doi.org/10.7939/R37H1DX8H

Download

Export to: EndNote  |  Zotero  |  Mendeley

Communities

This file is in the following communities:

Graduate Studies and Research, Faculty of

Collections

This file is in the following collections:

Theses and Dissertations

Computational Decipherment of Unknown Scripts Open Access

Descriptions

Other title
Subject/Keyword
Artificial Intelligence
Natural Language Processing
Decipherment
Decoding
Decryption
Anagrams
Alphagrams
Voynich
Voynich Manuscript
Hebrew
Beam Search
Monte Carlo Tree Search
Transliteration
Encryption
Language
Type of item
Thesis
Degree grantor
University of Alberta
Author or creator
Hauer, Bradley
Supervisor and department
Kondrak, Greg (Computing Science)
Examining committee member and department
Kondrak, Greg (Computing Science)
Hayward, Ryan (Computing Science)
Stewart, Selina (History and Classics)
Department
Department of Computing Science
Specialization

Date accepted
2016-04-26T11:20:22Z
Graduation date
2016-06
Degree
Master of Science
Degree level
Master's
Abstract
Algorithmic decipherment is a prime example of a truly unsupervised problem. This thesis presents several algorithms developed for the purpose of decrypting unknown alphabetic scripts representing unknown languages. We assume that symbols in scripts which contain no more than a few dozen unique characters roughly correspond to the phonemes of a language, and model such scripts as monoalphabetic substitution ciphers. We further allow that an unknown transposition scheme could have been applied to the enciphered text, resulting in arbitrary scrambling of letters within words (anagramming). We also consider the possibility that the underlying script is an abjad, in which only consonants are explicitly represented. Our decryption system is composed of three steps. The first step in the decipherment process is the identification of the encrypted language. We propose three methods for determining the source language of a document enciphered with a monoalphabetic substitution cipher. The best method achieves 97% accuracy on 380 languages. The second step is to map each symbol of the ciphertext to the corresponding letter in the identified language. We propose a novel approach to deciphering short monoalphabetic substitution ciphers which combines both character-level and word-level language models. Our method achieves a significant improvement over the state of the art on a benchmark suite of short ciphers. The third step is to decode the resulting anagrams into readable text, which may involve the recovery of unwritten vowels. Our approach obtains an average decryption word accuracy of 93% on a set of 50 ciphertexts in 5 languages. Finally, we apply our new techniques to the Voynich manuscript, a centuries-old document written in an unknown script, which has resisted decipherment despite decades of study.
Language
English
DOI
doi:10.7939/R37H1DX8H
Rights
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication
Bradley Hauer, Ryan Hayward, Grzegorz Kondrak. 2014. Solving Substitution Ciphers with Combined Language Models. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2314–2325, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.Hauer, Bradley, and Grzegorz Kondrak. "Decoding Anagrammed Texts Written in an Unknown Language and Script." Transactions of the Association for Computational Linguistics 4 (2016): 75-86.

File Details

Date Uploaded
Date Modified
2016-04-26T17:20:31.082+00:00
Audit Status
Audits have not yet been run on this file.
Characterization
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 963754
Last modified: 2016:11:16 14:41:32-07:00
Filename: Hauer_Bradley_M_201604_MSc.pdf
Original checksum: c1ebed2320e0c13ab85b998aa5efa6f8
Well formed: true
Valid: true
File title: Untitled
Page count: 69
Activity of users you follow
User Activity Date