Everybody is special in how we use language – how we speak, the words we use, etc. In an earlier blog post, we saw how speech recognition systems eliminate this variation by training on speech and language data that cover many accents, age groups, or other variations in speaking style you might think of. This creates very robust systems that work well for (nearly) every speaker; we call this “speaker-independent” speech recognition.
But in some cases, the individuality of the speaker matters and can be leveraged to create even better experiences – like our latest Dragon Individual offerings, that are typically used by one user. This allows us to go beyond speaker-independent speech recognition by adapting to each user in a speaker-dependent way. Dragon does this on several levels:
- It adapts to the user’s active vocabulary by inspecting texts the user has created in the past, both by adding custom words to its active vocabulary and by learning the typical phrases and text patterns the user employs.
- During each session, it does a fast adaptation of its acoustic model (capturing how words are pronounced) based on just a few seconds of speech from the user. By doing this, it can also adapt to how a user’s voice sounds in the moment; for instance, are they impacted by a cold, using a different microphone or is there a change in environment.
- During the optional enrollment step, or later after a dictation session ends, Dragon will do some more intense learning in an offline mode. It continues to adapt models very well over time to a specific user’s speaking patterns.
This latter point deserves more attention. Dragon uses Deep Neural Networks end-to-end both at the level of the language model — capturing the frequency of words and in which combinations they typically occur — and of the acoustic model, deciphering the smallest spoken units, or phonemes of a language.
These models are quite large and before they leave our labs, they have already been trained on lots and lots of data. One of the reasons why Neural Networks have taken off only now and not in the late 20th century when they were invented is that training is quite a computing intensive process. We use significant amounts of GPUs (Graphical Processing Unit) to train our models. GPUs were originally invented for computer graphic applications like video games. Computing images and training Deep Neural Networks have a lot in common as both tasks require the application of relatively simple calculations towards lots of data points at the same time, and this is what GPUs are good at. We use multiple GPUs in parallel in one training session to speed up the training process
But how do we apply this outside of our data centres? Adapting those Deep Neural Networks that make up the acoustic model to the speech coming from the user is similar to training them, and we want to make that happen on the user’s PC, Mac or laptop – and we want it to be fast. It is a demanding task as we need to make sure adaptation works with just a little data and computationally it is a very efficient process.
Packaging this process in a way that allows the individual to run it on their desktop or laptop is the culmination of many years of innovation in speech recognition and machine learning R&D. Enjoy the result of a highly accurate Dragon experience that is fully personalised to you and your voice.