Since 2017, Google Cloud has offered a Speech-to-Text (STT) API that third-parties can take advantage of in their own services. The newest models for Google speech recognition improve accuracy due to a “major” technology improvement, and are particularly suited for creating voice UIs.
The new neural sequence-to-sequence model for Google’s Speech-to-Text API improves accuracy in 23 languages and 61 of the supported locales. In addition to “out-of-box quality improvements,” there’s expanded support for different kinds of voices, noise environments, and acoustic conditions.
For the past several years, automated speech recognition (ASR) techniques have been based on separate acoustic, pronunciation, and language models. Historically, each of these three individual components was trained separately, then assembled afterwards to do speech recognition.
The conformer models that we’re announcing today are based on a single neural network. As opposed to training three separate models that need to be subsequently brought together, this approach offers more efficient use of model parameters.
These improvements allow for “more accurate outputs in more contexts,” with Google specifically touting how speech recognition can now be brought to more use cases. In the case of voice control UIs, “users [can] speak to these interfaces more naturally and in longer sentences.”
- “Latest long” is specifically designed for long-form spontaneous speech, similar to the existing “video” model.
- “Latest short,” on the other hand, gives great quality and great latency on short utterances like commands or phrases.
Spotify has been an early adopter of these new models, and worked “closely with Google” on the “Hey Spotify” voice interface found on the mobile apps and Car Thing, which we noted in our review was good at the underlying task of voice recognition and transcription:
The basics work fine, but having a voice assistant that can’t do anything additional beyond what, say, an always-listening Google Assistant on your phone could do is a bit frustrating. It is nice, though, that Car Thing moves the mics away from your phone for better accuracy. I was never disappointed with Car Thing’s ability to hear my commands.
Author: Abner Li