Multimodal interaction: How machines learn to understand pointing

Pointing at subjects and objects – be it with language or using gaze, gestures or the eyes only – is a very human ability. Smart, multimodal assistants, like in your car, account for these forms of pointing, thus making interaction more human-like than ever before. Made possible by image recognition and Deep Learning technologies, this will have significant implications for the autonomous vehicles of the future.

As we learn more about the biological world around us, the list of things only humans can do has dwindled – and that’s before computers started to play chess and Go. Counting? Birds can deal with numbers up to twelve. Using tools? Dolphins in Shark Bay, Australia, are using sponges as a tool for hunting.

Against this background, it may come as a surprise how specifically human pointing is: Although it seems very natural and easy to us, not even chimpanzees, our closest living relatives, can muster more than the most trivial forms of pointing. So how could we expect machines to understand it?

Three forms of pointing

    In 1934, the linguist and psychologist Karl Bühler distinguished three forms of pointing, all connected to language:

    The first is pointing “ad oculos,” that is in the field of visibility centered around the speaker (“here”) and also accessible to the listener. While we can point within this field with our fingers alone, languages offer a special set of pointing words to complement this (“here” vs. “there;” “this” vs. “that;” “left” and “right;” “before” and “behind” etc.).

    The second form of pointing operates in a remembered or imagined world, brought about by language (“When you leave the Metropolitan Museum, Central Park is behind you and the Guggenheim Museum is to your left. We will meet in front of that”).

    The third form is pointing within language: As speech is embedded in time, we often have the necessity to point back to something we said a little earlier or point forward to something we will say later. In a past blog post, I described how the anaphoric use of pointing words (“How is the weather in Tokyo?” “Nice and sunny.” “Are there any good hotels there?”) can be supported in smart assistants (and how this capability distinguishes the smarter assistants from the not-so-smart). And he first mode of pointing at elements in the visible vicinity is now also available in today’s smart assistants.

    First automotive assistants to support “pointing”

      At CES in Las Vegas this month, we demonstrated how drivers can point to buildings outside the car and ask questions like, “What are the opening hours of that shop?” But, the “pointing” doesn’t need to be done with a finger. With the new technology, you can simply look at the object in question, something made possible by eye gaze detection based on a camera tracking of the eyes.  This technology is imitating human behavior, as humans are very good at guessing where somebody is looking just by observing his or her eyes.

      Biologists suggest that the distinct shape and appearance of the human eye (a dark iris and a contrasting white surrounding) is no accident, but a product of evolution facilitating the ability of gaze detection. Artists have exploited that for many centuries: with just a few brush strokes of paint, they can make figures in their paintings look at other figures or even outside the picture – including at the viewer of the painting. Have a look at Raffael’s Sistine Madonna, which is displayed in Dresden, and see how the figures’ viewing directions make them point at each other and how that guides our view.

      RAFAEL - Madonna Sixtina (Gemäldegalerie Alter Meister, Dresden, 1513-14. Óleo sobre lienzo, 265 x 196 cm).jpg
      By Raphael – Google Art Project: Homepic Maximum resolution., Public Domain, Link


      Multimodal interaction: When speech, gesture, and hand writing work hand in hand

        Machines can also do this based on image recognition and Deep Learning,  capabilities which, coming out of our cooperation with DFKI, will bring us into the age of truly multimodal assistants. It is important to remember that “multimodal” does not just mean you have a choice between modalities (typing OR speaking OR handwriting on a pad to enter the destination into your navigation system), but that multiple modalities work together to accomplish one task. For example, when pointing to something in the vicinity (modality 1) and saying, “tell me more about this” (modality 2), both modalities are needed to explain what the person performing this wants to accomplish.

        Multimodal interaction – a key feature for Level 4 and 5 autonomous vehicles?

          While it is obvious why such a capability is attractive to today’s drivers, there are hints that it might become even more important as we enter the age of autonomous vehicles. Many people are wondering what drivers will do when they don’t have to drive any more, something they would experience in Levels 4 and 5 of the autonomous driving scale. Some studies indicate that perhaps the answer is not that much, actually. For example, a 2016 German study asked people about the specific advantages they perceived in such vehicles, and “… that I can enjoy the landscape” came out as the top choice at all levels of autonomy.

          It’s not too difficult to imagine a future with gaze and gesture detection, combined with a “just talk” mode of speech recognition – one where you can ask “what is that building?” without having to press a button or say a keyword first. This future will give users of autonomous vehicles exactly what they want. And for today’s users of truly multimodal systems, machines just got a little more human-like again.


          Human-like interaction

          Learn more about how Dragon Drive, Nuance’s hybrid automotive assistant, now tightly integrates conversational artificial intelligence (AI) with non-verbal modalities such as gaze detection.

          Learn more

          Tags: ,

          About Nils Lenke

          Nils Lenke is Senior Director, Corporate Research at Nuance Communications and oversees the coordination of various research initiatives within Nuance’s 300 strong global corporate research organisation, which is responsible for developing a broad range of cognitive computing technologies and applying these to solutions for the mobile, automotive, healthcare, and enterprise markets. The core technologies within the corporate research team’s remit covers deep learning, speech recognition, speech synthesis, natural language understanding and generation, dialogue, planning, reasoning, and knowledge representation. The applications of these artificial intelligence (AI) technologies include collaborative virtual assistants that enable more human-like interactions to enhance automation and productivity, as well as systems which extract knowledge and make predictions from data streams. Nils organises Nuance’s internal research conferences, coordinates Nuance’s ties to Academia and is board member of the DFKI (German Research Institute for Artificial Intelligence), the World’s largest AI centre, where Nuance is a shareholder. Nils joined Nuance (formerly ScanSoft) in 2003, after holding various roles for Philips Speech Processing for nearly a decade. He holds an M.A. from the University of Bonn after writing his thesis on “the Communication model in AI Research” in 1989, a Diploma in Computer Science from the University of Koblenz, a Ph.D. in Computational Linguistics from the University of Duisburg based on his AI-centric dissertation on Natural Language Generation of Paraphrases (1995), and finally an M.Sc. in Environmental Sciences from the University of Hagen. Nils has been awarded 8 patents for inventions ranging from a “speech recognition system for numerical characters” to “passive acquisition of knowledge in speech recognition and natural language understanding”. Nils can speak six languages; including his mother tongue German, and a little Russian and Mandarin. In his spare time, Nils enjoys hiking and hunting in archives for documents that shed some light on the history of science in the early modern period.