A comparison of input types to a deep neural network-based forced aligner

Matthew C. Kelley; Benjamin V. Tucker

doi:doi:10.7939/R3Z31P44Z

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Linguistics, Department of / Presentations (Linguistics)

Usage

299 views
226 downloads

A comparison of input types to a deep neural network-based forced aligner

Author(s) / Creator(s)
- Matthew C. Kelley
- Benjamin V. Tucker
Poster for the paper "A comparison of input types to a deep neural network-based forced aligner," presented at Interspeech 2018. Typo in alignment matrix (O[2,2] referenced O[1,2] instead of O[1,1]) updated on June 4, 2019.

PAPER ABSTRACT: The present paper investigates the effect of different inputs on the accuracy of a forced alignment tool built using deep neural networks. Both raw audio samples and Mel-frequency cepstral coefficients were compared as network inputs. A set of experiments were performed using the TIMIT speech corpus as training data and its accompanying test data set. The networks consisted of a series of convolutional layers followed by a series of bidirectional long short-term memory (LSTM) layers. The convolutional layers were trained first to act as feature detectors, after which their weights were frozen. Then, the LSTM layers were trained to learn the temporal relations in the data. The current results indicate that networks using raw audio perform better than those using Mel-frequency cepstral coefficients and an off-the-shelf forced aligner. Possible explanations for why the raw audio networks perform better are discussed. We then lay out potential ways to improve the results of the networks and conclude with a comparison of human cognition to network architecture. Preprint of paper number 1115 (pages 1205-1209) at Interspeech 2018. If citing this paper, please cite the Interspeech proceedings version, DOI: 10.21437/Interspeech.2018-1115.
Date created

2018-01-01
Subjects / Keywords
Type of Item

Conference/Workshop Poster
DOI

https://doi.org/10.7939/R3Z31P44Z
License

Attribution-ShareAlike 4.0 International

Language
- English