Bolt, Richard A. (1980)
“Put-That-There”: Voice and Gesture at the Graphics Interface
In International Conference on Computer Graphics and Interactive Techniques, pp. 262–270
Review by: Dreier, Matthias (2005-10-26)
In 1980, a time when most computers did not have a graphical user interface, the Massachusetts Institute of Technology (MIT) experimented with voice-input and gesture-recognition to operate a software application.
Richard Bolt describes the “Media Room” at the MIT in which the experiment took place. The room is equipped with a large screen, two TV monitors, loudspeakers, and a chair with two joysticks. All information is represented spatially, i. e. the wall-sized screen displays a map, for instance, and coloured shapes represent objects in this micro-world. The prototype application is operated by the conjoint use of voice-input and gesture-recognition.
At that time, voice-recognition was a very difficult task. The computer programme used in the experiment was capable of recognising a sequence of at most five words out of a vocabulary of a maximum of 120 words. With this limited capacity only a very simple application could be built. In the Media Room application the user was able to create simple geometric shapes, give them names and attributes like colour and size, move them around on a map, and delete them. A true innovation of the Media Room was the combination with the gesture-recognition. Being able to point at objects and say a command is a very natural way of interacting with a system.
Today, most computers have a graphical user interface but voice-input and gesture-recognition is still rare. We are still far away from natural-language interfaces and widespread use of gesture-recognition. The Media Room showed that the combination of these two inputs is an effective way of commanding an application. But is it efficient, as well? As long as voice-recognition is still so error-prone it will never prevail over mouse and keyboard input.
Bolt presents a prototype application with a totally new interface paradigm. The article is sometimes very technical, particularly when Bolt describes the details of the gesture-recognizer: the nutating dipoles, the epoxied coils, and their mutually orthogonal mountings. However, the concept of continuous visual feedback and the system design that is based upon what is most convenient for the user, not for the programmer are groundbreaking. The article also touches aspects of natural language processing, a topic that inevitably emerges from the new interaction paradigm.