A comparison of input types to a deep neural network-based forced aligner

  • Author(s) / Creator(s)
  • The present paper investigates the effect of different inputs on the accuracy of a forced alignment tool built using deep neural networks. Both raw audio samples and Mel-frequency cepstral coefficients were compared as network inputs. A set of experiments were performed using the TIMIT speech corpus as training data and its accompanying test data set. The networks consisted of a series of convolutional layers followed by a series of bidirectional long short-term memory (LSTM) layers. The convolutional layers were trained first to act as feature detectors, after which their weights were frozen. Then, the LSTM layers were trained to learn the temporal relations in the data. The current results indicate that networks using raw audio perform better than those using Mel-frequency cepstral coefficients and an off-the-shelf forced aligner. Possible explanations for why the raw audio networks perform better are discussed. We then lay out potential ways to improve the results of the networks and conclude with a comparison of human cognition to network architecture.

    Preprint of paper number 1115 (pages 1205-1209) at Interspeech 2018. If citing this paper, please cite the Interspeech proceedings version, DOI: 10.21437/Interspeech.2018-1115

  • Date created
  • Subjects / Keywords
  • Type of Item
    Conference/Workshop Presentation
  • DOI
  • License
    Attribution-ShareAlike 4.0 International