A comparison of input types to a deep neural network-based forced aligner

Matthew C. Kelley; Benjamin V. Tucker

doi:doi:10.7939/R34T6FJ44

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Linguistics, Department of / Research Publications (Linguistics)

Usage

442 views
916 downloads

A comparison of input types to a deep neural network-based forced aligner

Author(s) / Creator(s)
- Matthew C. Kelley
- Benjamin V. Tucker
The present paper investigates the effect of different inputs on the accuracy of a forced alignment tool built using deep neural networks. Both raw audio samples and Mel-frequency cepstral coefficients were compared as network inputs. A set of experiments were performed using the TIMIT speech corpus as training data and its accompanying test data set. The networks consisted of a series of convolutional layers followed by a series of bidirectional long short-term memory (LSTM) layers. The convolutional layers were trained first to act as feature detectors, after which their weights were frozen. Then, the LSTM layers were trained to learn the temporal relations in the data. The current results indicate that networks using raw audio perform better than those using Mel-frequency cepstral coefficients and an off-the-shelf forced aligner. Possible explanations for why the raw audio networks perform better are discussed. We then lay out potential ways to improve the results of the networks and conclude with a comparison of human cognition to network architecture.

Preprint of paper number 1115 (pages 1205-1209) at Interspeech 2018. If citing this paper, please cite the Interspeech proceedings version, DOI: 10.21437/Interspeech.2018-1115
Date created

2018-01-01
Subjects / Keywords
Type of Item

Conference/Workshop Presentation
DOI

https://doi.org/10.7939/R34T6FJ44
License

Attribution-ShareAlike 4.0 International

Language
- English