APhL Aligner: A Neural Network Forced-Alignment System

  • Author(s) / Creator(s)
  • Forced alignment is increasingly used in phonetics to automatically produce boundaries between words and phones. These boundaries can have significant errors and are often only placed at some predetermined time interval, like every 10 ms. We discuss some potential remedies to these difficulties and test them in a new neural network-based forced alignment system called the APhL Aligner, trained on the TIMIT and Buckeye speech corpora. In part, errors incurred during forced alignment can be attributed to the acoustic models that attempt to separate phones from each other. Even state-of-the-art neural network models struggle to acoustically separate phones. We examine the effect of relaxing the requirement to separate phones by instead training separate detectors for each phone class. Resolving the 10 ms interval difficulty requires a different approach. As with most aligners, we perform a Viterbi-style alignment to align windows of audio spaced at 10 ms to the phone string given by a pronunciation dictionary. We add an additional step, however, and use linear interpolation to determine an intermediate point after the 10 ms interval to place the boundary. We compare the results of these manipulations to the results of the Montreal Forced Aligner, custom-trained on the same data.

  • Date created
  • Subjects / Keywords
  • Type of Item
    Conference/Workshop Presentation
  • DOI
  • License
    Attribution 4.0 International