Why we’re using Deep Learning for our Dragon speech recognition engine

Everybody is special in how we use language – how we speak and the words we use. And in some cases, the individuality of the speaker matters and can be leveraged to create even better experiences through Deep Learning and Neural Networks – like our latest Dragon Individual and Dragon Legal offerings.
By
Dragon uses deep learning for more accurate speech recognition.

Everybody is special in how we use language – how we speak, the words we use, etc. In an earlier blog post, we saw how speech recognition systems eliminate this variation by training on speech and language data that cover many accents, age groups, or other variations in speaking style you might think of. This creates very robust systems that work well for (nearly) every speaker; we call this “speaker-independent” speech recognition.

But in some cases, the individuality of the speaker matters and can be leveraged to create even better experiences – like our latest Dragon Individual offerings, that are typically used by one user.  This allows us to go beyond speaker-independent speech recognition by adapting to each user in a speaker-dependent way. Dragon does this on several levels:

  • It adapts to the user’s active vocabulary by inspecting texts the user has created in the past, both by adding custom words to its active vocabulary and by learning the typical phrases and text patterns the user employs.
  • During each session, it does a fast adaptation of its acoustic model (capturing how words are pronounced) based on just a few seconds of speech from the user. By doing this, it can also adapt to how a user’s voice sounds in the moment; for instance, are they impacted by a cold, using a different microphone or is there a change in environment.
  • During the optional enrollment step, or later after a dictation session ends, Dragon will do some more intense learning in an offline mode. It continues to adapt models very well over time to a specific user’s speaking patterns.

This latter point deserves more attention. Dragon uses Deep Neural Networks end-to-end both at the level of the language model — capturing the frequency of words and in which combinations they typically occur — and of the acoustic model, deciphering the smallest spoken units, or phonemes of a language.

These models are quite large and before they leave our labs, they have already been trained on lots and lots of data. One of the reasons why Neural Networks have taken off only now and not in the late 20th century when they were invented is that training is quite a computing intensive process. We use significant amounts of GPUs (Graphical Processing Unit) to train our models. GPUs were originally invented for computer graphic applications like video games. Computing images and training Deep Neural Networks have a lot in common as both tasks require the application of relatively simple calculations towards lots of data points at the same time, and this is what GPUs are good at. We use multiple GPUs in parallel in one training session to speed up the training process

But how do we apply this outside of our data centres? Adapting those Deep Neural Networks that make up the acoustic model to the speech coming from the user is similar to training them, and we want to make that happen on the user’s PC, Mac or laptop – and we want it to be fast. It is a demanding task as we need to make sure adaptation works with just a little data and computationally it is a very efficient process.

Packaging this process in a way that allows the individual to run it on their desktop or laptop is the culmination of many years of innovation in speech recognition and machine learning R&D. Enjoy the result of a highly accurate Dragon experience that is fully personalised to you and your voice.

Sources:

Deep learning powers new Dragon suite

New suite of Dragon professional productivity solutions powered by Nuance Deep Learning technology drive documentation productivity with higher accuracy, speed and efficiency.

Learn more

Tags: , , , , ,

About Nils Lenke

Nils Lenke is Senior Director, Corporate Research at Nuance Communications and oversees the coordination of various research initiatives within Nuance’s 300 strong global corporate research organisation, which is responsible for developing a broad range of cognitive computing technologies and applying these to solutions for the mobile, automotive, healthcare, and enterprise markets. The core technologies within the corporate research team’s remit covers deep learning, speech recognition, speech synthesis, natural language understanding and generation, dialogue, planning, reasoning, and knowledge representation. The applications of these artificial intelligence (AI) technologies include collaborative virtual assistants that enable more human-like interactions to enhance automation and productivity, as well as systems which extract knowledge and make predictions from data streams. Nils organises Nuance’s internal research conferences, coordinates Nuance’s ties to Academia and is board member of the DFKI (German Research Institute for Artificial Intelligence), the World’s largest AI centre, where Nuance is a shareholder. Nils joined Nuance (formerly ScanSoft) in 2003, after holding various roles for Philips Speech Processing for nearly a decade. He holds an M.A. from the University of Bonn after writing his thesis on “the Communication model in AI Research” in 1989, a Diploma in Computer Science from the University of Koblenz, a Ph.D. in Computational Linguistics from the University of Duisburg based on his AI-centric dissertation on Natural Language Generation of Paraphrases (1995), and finally an M.Sc. in Environmental Sciences from the University of Hagen. Nils has been awarded 8 patents for inventions ranging from a “speech recognition system for numerical characters” to “passive acquisition of knowledge in speech recognition and natural language understanding”. Nils can speak six languages; including his mother tongue German, and a little Russian and Mandarin. In his spare time, Nils enjoys hiking and hunting in archives for documents that shed some light on the history of science in the early modern period.