How the technology transcribing your meetings actually works

Which do you think is easier? Developing speech recognition to be used in-flight by a jet pilot, or to transcribe a meeting between colleagues? The answer may surprise you.
technology can transcribe meetings between colleagues

Back in the 1990s, when you told someone that your company was working on speech recognition for jet pilots they would inevitably say, “Wow that must be difficult, because of all the noise.” And you would say, “Well, yes and no.” Yes, there is noise, but it is very predictable noise (caused by the engine and the wind). This “stationary” noise can be filtered out quite reliably. Plus, the microphone is always in the same place and positioned very closely to the pilot (e.g. fixed in their oxygen mask). So it actually was simpler than it sounded.

But, the reverse can also be true: something seemingly simple is more difficult in actuality. People who want speech recognition to automatically transcribe what is said during a meeting (because nobody wants to be the scribe!) assume it’s an easy task. It really can’t be that hard to capture a meeting, right? There’s more to it than you think.

A number of variables come into play. First, we have to identify who is talking and where they are located. Conference rooms often feature more than one microphone and the potential speakers may be scattered around them. This includes scenarios where speakers are quite distant from the closest microphone (also causing reverberation to be a problem, a kind of echo from the room walls). So, initially, we will not know who is speaking and where they are located with respect to the microphone.  To account for this, the system will focus its attention on the active speaker only, working to filter out any background noises and the echo effects mentioned above. As humans, we do that all the time, without thinking much about it. You may have heard this referred to as the cocktail party effect. In an environment where more than one microphone is available, we can mimic this capability by applying beamforming technology, which we also use in the car and in home environments.

Related is the task of distinguishing between multiple speakers because they will alternate over time (which means you need to continually adapt your beamforming). Speaker diarisation – or, sorting speech into per speaker buckets – is how we do this. One helpful trick is to make use of voice biometric technology. While its main use case is to authenticate a speaker, you can also use it to identify a known speaker in a group. Once you have succeeded with diarisation, you can also use the speech of each individual speaker to adapt the speech recognition models to better reflect their characteristics, similar to how we do it for our Dragon dictation software.

Of course, there may even be times when multiple participants speak at the same time. True, humans typically employ an elaborate ‘turn taking’ system to predict when it is a good time to take over the role of speakers, but as we all know, that doesn’t always work – more often than not, multiple people will speak at the same time during a meeting. This cross talk is the next challenge we are facing, and again, exploiting multiple microphones will help.

Now that we know who is speaking and when (and how), we can start with the actual task: applying speech recognition. This brings about another variable. Often, we will not have previous knowledge of the meeting topic, so our vocabulary will be very large and it will be difficult to predict what will come next, based on context. Recent progress in Language Modeling seeks to do exactly that – predict words based on context – by using Deep Neural Networks.

With these tools in hand, my colleagues working on capturing and transcribing so-called “ambient speech” have recently reported that they are now beating published results on publicly available test sets by a margin. And beyond the lab, we have actually released Nuance Transcription Engine. NTE is primarily targeted at a related use case, transcribing the conversations between call center agents and customers for actionable insights, but it can be used in a wide range of environments for capturing multi-speaker conversations as well.

Even though it’s not as straightforward a task as you may have thought, by combining several different technologies in the right way, we are able to transcribe meetings with successful results. The office of the future may have just found its new scribe.


Tags: , , ,

About Nils Lenke

Nils Lenke is Senior Director, Corporate Research at Nuance Communications and oversees the coordination of various research initiatives within Nuance’s 300 strong global corporate research organisation, which is responsible for developing a broad range of cognitive computing technologies and applying these to solutions for the mobile, automotive, healthcare, and enterprise markets. The core technologies within the corporate research team’s remit covers deep learning, speech recognition, speech synthesis, natural language understanding and generation, dialogue, planning, reasoning, and knowledge representation. The applications of these artificial intelligence (AI) technologies include collaborative virtual assistants that enable more human-like interactions to enhance automation and productivity, as well as systems which extract knowledge and make predictions from data streams. Nils organises Nuance’s internal research conferences, coordinates Nuance’s ties to Academia and is board member of the DFKI (German Research Institute for Artificial Intelligence), the World’s largest AI centre, where Nuance is a shareholder. Nils joined Nuance (formerly ScanSoft) in 2003, after holding various roles for Philips Speech Processing for nearly a decade. He holds an M.A. from the University of Bonn after writing his thesis on “the Communication model in AI Research” in 1989, a Diploma in Computer Science from the University of Koblenz, a Ph.D. in Computational Linguistics from the University of Duisburg based on his AI-centric dissertation on Natural Language Generation of Paraphrases (1995), and finally an M.Sc. in Environmental Sciences from the University of Hagen. Nils has been awarded 8 patents for inventions ranging from a “speech recognition system for numerical characters” to “passive acquisition of knowledge in speech recognition and natural language understanding”. Nils can speak six languages; including his mother tongue German, and a little Russian and Mandarin. In his spare time, Nils enjoys hiking and hunting in archives for documents that shed some light on the history of science in the early modern period.