Article
09/08/2015

April 2012

Volume 27 Number 04

Kinect - Context-Aware Dialogue with Kinect

By Leland Holmquest | April 2012

Meet Lily, my office assistant. We converse often, and at my direction Lily performs common business tasks such as looking up information and working with Microsoft Office documents. But more important, Lily is a virtual office assistant, a Microsoft Kinect-enabled Windows Presentation Foundation (WPF) application that’s part of a project to advance the means of context-aware dialogue and multimodal communication.

Before I get into the nuts-and-bolts code of my app—which I developed as part of my graduate work at George Mason University—I’ll explain what I mean by context-aware dialogue and multimodal communication.

Context-Aware Dialogue and Multimodal Communication

As human beings, we have rich and complex means of communicating. Consider the following scenario: A baby begins crying. When the infant notices his mother is looking, he points at a cookie lying on the floor. The mother smiles in that sympathetic way mothers have, bends over, picks up the cookie and returns it to the baby. Delighted at the return of the treasure, the baby squeals and gives a quick clap of its hands before greedily grabbing the cookie.

This scene describes a simple sequence of events. But take a closer look. Examine the modes of communication that took place. Consider implementing a software system where either the baby or the mother is removed and the communication is facilitated by the system. You can quickly realize just how complex and complicated the communication methods employed by the two actors really are. There’s audio processing in understanding the baby’s cry, squeal of joy and the sound of the clap of hands. There’s the visual analysis required to comprehend the gestures represented by the baby pointing at the cookie, as well as inferring the mild reproach of the mother by giving the sympathetic smile. As often is the case with actions as ubiquitous as these, we take for granted the level of sophistication employed until we have to implement that same level of experience through a machine.

Let’s add a little complexity to the methods of communication. Consider the following scenario. You walk into a room where several people are in the middle of a conversation. You hear a single word: “cool.” The others in the room look to you to contribute. What could you offer? Cool can mean a great many things. For example, the person might have been discussing the temperature of the room. The speaker might have been exhibiting approval of something (“that car is cool”). The person could have been discussing the relations between countries (“negotiations are beginning to cool”). Without the benefit of the context surrounding that single word, one stands little chance of understanding the meaning of the word at the point that it’s uttered. There has to be some level of semantic understanding in order to comprehend the intended meaning. This concept is at the core of this article.

Project Lily

I created Project Lily as the final project for CS895: Software for Context-Aware Multiuser Systems at George Mason University, taught by Dr. João Pedro Sousa. As mentioned, Lily is a virtual assistant placed in a typical business office setting. I used the Kinect device and the Kinect for Windows SDK beta 2. Kinect provides a color camera, a depth-sensing camera, an array of four microphones and a convenient API that can be used to create natural UIs. Also, the Microsoft Kinect for Windows site (microsoft.com/en-us/kinectforwindows) and Channel 9 (bit.ly/zD15UR) provide a plethora of useful, related examples. Kinect has brought incredible capabilities to developers in a (relatively) inexpensive package. This is demonstrated by Kinect breaking the Guinness World Records “fastest selling consumer device” record (on.mash.to/hVbZOA). The Kinect technical specifications (documented at bit.ly/zZ1PN7) include:

Color VGA motion camera: 640x480 pixel resolution at 30 frames per second (fps)
Depth camera: 640x480 pixel resolution at 30 fps
Array of four microphones
Field of view
- Horizontal field of view: 57 degrees
- Vertical field of view: 43 degrees
- Physical tilt range: ± 27 degrees
- Depth sensor range: 1.2m - 3.5m
Skeletal tracking system
- Ability to track up to six people, including two active players
- Ability to track 20 joints per active player
An echo-cancellation system that enhances voice input
Speech recognition

Kinect Architecture

Microsoft also provides an understanding of the architecture that Kinect is built upon, shown in Figure 1.

Figure 1 Kinect for Windows Architecture

The circled numbers in Figure 1 correspond to the following:

Kinect hardware: The hardware components, including the Kinect sensor and the USB hub, through which the sensor is connected to the computer.
Microsoft Kinect drivers: The Windows 7 drivers for the Kinect sensor are installed as part of the beta SDK setup process as described in this document. The Microsoft Kinect drivers support:
- The Kinect sensor’s microphone array as a kernel-mode audio device that you can access through the standard audio APIs in Windows.
- Streaming image and depth data.
- Device enumeration functions that enable an application to use more than one Kinect sensor that’s connected to the computer.
NUI API: A set of APIs that retrieves data from the image sensors and controls the Kinect devices.
KinectAudio DMO: The Kinect DMO that extends the microphone array support in Windows 7 to expose beamforming and source localization functionality.
Windows 7 standard APIs: The audio, speech and media APIs in Windows 7, as described in the Windows 7 SDK and the Microsoft Speech SDK (Kinect for Windows SDK beta Programming Guide).

In this article, I’ll demonstrate how I used the microphone array and the speech recognition engine (SRE) to create vocabularies that are context-specific. In other words, the vocabularies for which Kinect is listening will be dependent on the context the user is creating. I’ll show a framework in which the application will monitor what the user is doing, swapping grammars in and out of the SRE (depending on the context), which gives the user a natural and intuitive means of interacting with Lily without having to memorize specific commands and patterns of use.

The SpeechTracker Class

For Project Lily, I used a separate class called SpeechTracker to handle all voice processing. The SpeechTracker was designed to run on a thread independent of the UI, making it much more responsive, which is a critical aspect to this application. What good is an assistant who’s never listening?

Prior to getting into the heart of the SpeechTracker, a few things require some forethought. The first is determining the contexts for which the application needs to listen. For example, Lily will have a “Research” context that handles actions related to looking up data and information; an “Office” context that handles actions such as opening Word documents and PowerPoint presentations as well as other tasks relevant to an office setting; an “Operations” context that allows the user to make adjustments to the Kinect device; and a “General” context that handles all of the little things that can come up independent of any real context (for example, “What time is it?” “Shut down” or feedback to the system). There are a few other contexts I created in my example, but one other that’s important is the “Context” context. This context is used to communicate what frame of reference Lily should be in. In other words, the Context enum is used to determine the context for which the system should be listening. More on this later.

Each context is then modeled through the use of enums, as illustrated in Figure 2.

Figure 2 Example of Context

With the contexts modeled, the next thing needed is a means to convey what intention a user has within a specified context. In order to achieve this, I used a struct that simply contains a holder for each context and a bool:

struct Intention
{
  public Context context;
  public General general;
  public Research research;
  public Office office;
  public Operations kinnyops;
  public Meeting meeting;
  public Translate translate;
  public Entertain entertain;
  public bool contextOnly;
}

The bool is of significance. If the Intention is simply to execute a changing of context, then contextOnly is true—otherwise it’s false. The utility of this will be clearer later on; for now, know that it’s a necessary flag.

What’s the User’s Intention?

Now that Project Lily has the ability to communicate an Intention, it needs to know when and what Intention to use at run time. In order to do this, I created a System.Collections.Generic.Dictionary<TKey, TValue> Dictionary in which the key is a spoken word or phrase and the value is the associated Intention. As of the Microsoft .NET Framework 3.0, we have a succinct way to create objects and initialize properties, as shown in Figure 3.

Figure 3 The ContextPhrases Dictionary

This particular Dictionary defines the Context phrases. The keys are the words and phrases the end user speaks that are then associated with an Intention. Each Intention is declared and its properties set in a single line. Notice that a single Intention (for example, Research) can be mapped to multiple single words and phrases. This provides the ability to create rich vocabularies that can model the domain-specific language and lexicon of the subject at hand. (Notice one important thing about the ContextPhrases: The property contextOnly is set to true. This tells the system that this action is used only to change out what context is active. This enables the system to short-cycle the logic behind handling spoken events.)

For a better example of this, look at the snippet from the GeneralPhrases dictionary shown in Figure 4.

Figure 4 The GeneralPhrases Dictionary

Notice that in several cases the same Intention is modeled with different phrases, giving the system the capability to handle dialogue with the user in a rich and humanistic way. One limitation you need to be aware of is that any dictionary that’s going to be used by the SRE can contain no more than 300 entries. Therefore, it’s wise to model one’s vocabularies and grammars prudently. Even if this were not an imposed limitation, it should be considered a best practice to keep the dictionaries as light as possible for best performance.

Now that the vocabularies are mapped to Intentions, I can get to the fun part. First, the system needs to get a handle to the SpeechRecognitionEngine:

ri = SpeechRecognitionEngine.InstalledRecognizers()
    .Where(r => r.Id ==  RecognizerId).FirstOrDefault();
if (ri == null)
{
  // No RecognizerInfo => bail
  return;
}
sre = new SpeechRecognitionEngine(ri.Id);

Now, I’ll take the phrases I developed earlier and turn them into Choices:

// Build our categories of Choices
var contextPhrases = new Choices();
foreach (var phrase in ContextPhrases)
  contextPhrases.Add(phrase.Key);

I’ll do this same operation for all of the phrases I created earlier. The Choices are then passed into the Append method on a GrammarBuilder. The last thing is to create the Grammar objects and load them into the SRE. To accomplish this, I simply create a new Grammar and pass in the GrammarBuilder representing the desired grammar, as shown in Figure 5.

Figure 5 Creating Grammars and Loading the Speech Recognition Engine

// And finally create our Grammars
gContext = new Grammar(gbContext);
gGeneral = new Grammar(gbGeneral);
gResearch = new Grammar(gbResearch);
gOffice = new Grammar(gbOffice);
gOperations = new Grammar(gbOperations);
gMeeting = new Grammar(gbMeeting);
gTranslation = new Grammar(gbTranslation);
gEntertain = new Grammar(gbEntertain);
// We're only going to load the Context and General grammars at this point
sre.LoadGrammar(gContext);
sre.LoadGrammar(gGeneral);
allDicts = new List<Dictionary<string, 
    Intention>>() { ContextPhrases,
                    GeneralPhrases,
                    ResearchPhrases,
                    OfficePhrases,
                    OperationsPhrases,
                    MeetingPhrases,
                    TranslationPhrases,
                    EntertainPhrases };

Notice that there are only two Grammars, gContext and gGeneral, loaded into the SRE, but all of the grammars were added into a List of phrases. This is how I affect the context-aware portion of listening. General and context phrases need to be continually present in the SRE because they might be spoken at any time, without any predetermined pattern. However, one additional grammar will be loaded when a context phrase is identified. To complete this part of the application, I simply handle the SpeechRecognized event on the SRE. The SpeechRecognizedEventArgs argument is used to evaluate what was spoken. Referring back to Figure4, if the argument has an Intention that’s flagged as contextOnly, then the system needs to unload the third Grammar (if there is one) from the SRE and load the newly identified Grammar. This enables the application to listen to different vocabularies and lexicons dependent on the context that’s currently in scope.

The dictionary (see Figure 4) has a key of type string that represents the spoken phrase and a value of type Intention that indicates what actions are intended for the system to take in order to satisfy the user’s needs. Within the addition of each entry in the dictionary is the constructor for the Intention, which is typically comprised of three components: the context association; whether this is a context-changing event; and (in the case of non-context-changing) what action is intended.

With all of the dictionaries of phrases supported defined, I add this information to the SRE provided by the Kinect unit, as illustrated in Figure 6.

Figure 6 Speech Recognition Engine

This effectively tells the SRE what to listen for and how to tag the information when it’s heard. However, in an effort to make the system more intelligent and useable, I’ve limited the system to have only three contexts and their respective phrase dictionaries loaded in the SRE at any given time. The Context and General phrases are always loaded due to their fundamental nature. The third loaded context and phrases are determined by interaction with the end user. As Lily listens to the environment, “she” reacts to key words and phrases and removes one set of phrases to replace it with another in the SRE. An example will help clarify this.

How It Works

When Lily first starts up, ContextPhrases and GeneralPhrases are loaded into the SRE. This enables the system to hear commands that will cause either the system’s context to change or will facilitate general actions. For example, after initialization is complete, should the user ask, “What time is it?” Lily “understands” this (it’s part of the GeneralPhrases) and responds with the current time. Similarly, if the user says, “I need some information,” Lily understands that this is a flag to load the ResearchPhrases into the SRE and begin listening for intentions mapped to the Research context. This enables Lily to achieve three important goals:

Lily’s performance is maximized by listening only for the minimum set of phrases that could be relevant.
Enable use of language that would be ambiguous due to the differing meanings within different contexts (recall the “cool” example) by allowing only the lexicon of the specified context.
Enable the system to listen to several differing phrases but map multiple phrases to the same actions (for example, “Lily, what time is it?” “What time is it?” and “Do you know the time?” can all be mapped to the same action of telling the user the time). This potentially enables a rich lexicon for context-specific dialogue with the system. Instead of forcing a user to memorize keywords or phrases in a one-to-one fashion, mapping intended actions to a command, the designer can model several different common ways to indicate the same thing. This affords the user the flexibility to say things in normal, relaxed ways. One of the goals of ubiquitous computing is to have the devices fade into the background. Creating context-aware dialogue systems such as Lily helps remove the user’s conscious awareness of the computer—it becomes an assistant, not an application.

Armed with all the requisite knowledge, Lily can now listen and respond with context-appropriate actions. The only thing left to do is instantiate the KinectAudioSource and specify its parameters. In the case of Project Lily, I put all of the audio processing within the SpeechTracker class. I then call the BeginListening method on a new thread, separating it from the UI thread. Figure 7 shows that method.

Figure 7 KinectAudioSource

private void BeginListening()
{
  kinectSource = new KinectAudioSource());
  kinectSource.SystemMode = SystemMode.OptibeamArrayOnly;
  kinectSource.FeatureMode = true;
  kinectSource.AutomaticGainControl = true;
  kinectSource.MicArrayMode = MicArrayMode.MicArrayAdaptiveBeam;
  var kinectStream = kinectSource.Start();
  sre.SetInputToAudioStream(kinectStream,
       new SpeechAudioFormatInfo(EncodingFormat.Pcm,
       16000,
       16,
       1,
       32000,
       2,
       null));
sre.RecognizeAsync(RecognizeMode.Multiple);
}

Several parameters can be set, depending on the application being built. For details on these options, check out the documentation in the Kinect for Windows SDK Programming Guide. Then I simply have my WPF application register with the SpeechTracker SpeechDetected event, which is essentially a pass-through of the SRE SpeechRecognized event, but using the Intention as part of the event arguments. If the SRE finds a match of any of the context phrases it has loaded, it raises the SpeechRecognized event. The SpeechTracker handles that event and evaluates whether the Intention indicates a changing of context. If it does, the SpeechTracker handles unloading and loading the Grammars appropriately and raises the SpeechContextChanged event. Otherwise, it raises the SpeechDetected event and allows anything monitoring that event to handle it.

The Confidence Property

One thing to note: An example I found online stated that the Confidence property in the SpeechRecognizedEventArgs is unreliable and to not use it (contradictory to the SDK documentation). I found that when I didn’t use it, the SpeechRecognized event was firing fairly continually even when nothing was being spoken. Therefore, in my SpeechRecognized event handler, the very first thing I do is check the Confidence. If it isn’t at least 95 percent confident, I ignore the results. (The number 95 came simply from trial and error—I can’t take credit for analyzing a valid value. 95 percent just seemed to give me the level of results that I was looking for. In fact, the SDK advises to test and evaluate this value on a case-by-case basis.) When I did this, the false positives went away to 0. So I would advise testing any statements and solutions you might find online carefully. My experience was that the SDK documentation—as well as the samples that Microsoft provided at the Kinect for Windows site—was immensely valuable and accurate.

I’m frequently asked: how much speech recognition training does Kinect require? In my experience, none. Once I set the Confidence value, Kinect worked perfectly without any training, tuning and so on. I used myself as the main test subject but also included my 7- and 8-year-old daughters (thanks Kenzy and Shahrazad!), and they were thrilled to find they could tell daddy’s computer what to do and it understood and acted. Very empowering to them; very rewarding for daddy!

Creating the Illusion of a Human Assistant

Being able to switch contexts based on what the system can observe of the environment creates a rich interaction between the user (human) and Lily (machine). To really enhance the illusion of having an assistant, I added a lot of little features. For example, when we human beings talk to one another, we don’t necessarily repeat the same phrases over and over. Therefore, when the Intention of “pulse” is handled, the system picks out a random phrase to speak back to the user. In other words, when the user queries, “Lily?” Lily will randomly respond, “Yes,” “I am here,” “What can I do for you?” or a few other phrases.

I took this one step further. Some of the phrases include a holder for the user’s name or gender-specific pronoun (sir or ma’am). If one of these phrases is randomly selected, then it’s randomly determined whether the name or pronoun is used. This creates dialogue that’s never quite the same. Little details such as this seem trivial and hardly worth the effort, but when you look at the response the human user experiences while interacting with the system, you come to realize that the uniqueness of our communications comes in part from these little details. I found myself thinking of Lily differently from other programs. As I was testing and debugging, I would need to look up some specification or some piece of code. So as I began searching I would continue talking to Lily and instruct “her” to shut down and so on. After a while, I found that I missed that level of “human interaction” when Lily was down. I think this is the best testimony I can give that Kinect brings a whole new era of natural UI.

To review, Figure 8 shows the progression of the objects needed to create a system that listens and acts on a user’s verbal commands.

Figure 8 Creating a System that Listens and Acts on Verbal Commands

In the next article, I’ll examine leveraging the Kinect depth- and skeleton-tracking capabilities and coupling them with the speech-recognition capabilities to achieve multimodal communication. The importance of context will become even more obvious within the multimodal communication component. You’ll see how to link motions of the body to audio commands and have the system evaluate the entirety of the human communication spectrum.

Leland Holmquest is an employee at Microsoft. Previously he worked for the Naval Surface Warfare Center Dahlgren. He’s working on his Ph.D. in Information Technology at George Mason University.

Thanks to the following technical expert for reviewing this article: Russ Williams