December 2017

Volume 32 Number 12

[Mixed Reality]

Using Cognitive Services in Mixed Reality

By Tim Kulp

Mixed reality (MR) has a lot of potential for enhancing the way users interact with their environment. You can build immersive worlds for them to explore, or you can extend the real world with digital objects. Whether creating next-generation games, education tools or business apps, MR lets you build new ways of interacting with the digital world. In this article I’m going to show you how to take your MR app further into the real world—away from onscreen buttons and to a conversation-based interface—by using Microsoft Cognitive Services.

Cognitive Services is a powerful complement to MR. One of the limits of MR is its comprehension of the world. Tools like Spatial Mapping provide the MR system with an understanding of surfaces and collision points (where your digital content can rest but not pass through). Cognitive Services allows you to turn surfaces into tables, chairs, flooring and more. In other words, Cognitive Services lets MR understand that a particular surface is not just a surface, but a wooden table. Beyond understanding the environment, Cognitive Services enables MR to further reflect the physical world by using speech for interactions with digital objects, and custom gestures to tailor how people think about interacting with objects.

For this article, I’m going to build an MR app using Unity, C# and Vuforia. You’ll want to start with Unity Beta 2017.3.0b3, which has Vuforia as an installation option. Make sure to install the Vuforia components for this version of Unity. To use Vuforia, you’ll need to sign up for the Developer Portal and set up your targets there. Details about how to set up your Vuforia account can be found at bit.ly/2yrE6yH. Keep in mind that because you’ll be using Unity, you’ll be working with the Microsoft .NET Framework 3.5.

What Am I Building?

I’ll be building an app for Contoso Power, an energy utility that is using MR for field services. The application will provide answers to technicians in the field based on the equipment the technician scans for the MR experience. Imagine that each piece of equipment has unique repair instructions. When the technician scans the MR marker on the equipment, the system identifies the question/answer set to use. The technician can then verbally activate the app’s “listening mode,” ask the question (again verbally) and get an audio response from the app. When the technician is finished, they can end the call. I’m going to use MR to load a digital model (sometimes called the digital twin) of an object and then add a conversational interface to allow the technician to learn more about how to service the equipment.

When you download the sample code from bit.ly/2gF8LhP, you’ll get all the digital assets for this project. There are two separate Unity projects, one called Start and one called Complete. Start is an empty solution with only some digital assets to get you going. Complete contains all the scripts and digital assets for reference. I’ll be working from Start but I’m also providing Complete for clarification of a few items.

Building the First MR Scene

Target-based MR scenes—also known as marker-based augmented reality (AR)—have three components:

  • ARCamera: This is the camera of the device viewing the MR scene.
  • Target: This is the real-world object that triggers the digital content to appear.
  • Stage: This is the world that the MR elements exist within.

When you open the Start project, you’ll find only a Main Camera and Directional Light showing. Delete the Main Camera and click on Game Object | Vuforia | AR Camera. This will add the basic AR Camera to your scene. Click on the AR Camera and then on Open Vuforia Configuration. This is where you need to add the App License Key you created from the Vuforia Developer Portal, as shown in Figure 1. If you haven’t done that already, click the Add License button to go through the tutorial. Unity 2017.3 is connected to the Vuforia developer workflow, which allows you to easily jump from Unity to the Vuforia Developer Portal.

The Vuforia Configuration Inspector
Figure 1 The Vuforia Configuration Inspector

Once your key is loaded, you can add targets. First, however, you need to have a target database for your MR app. Click the Add Database button (still in the Vuforia Configuration inspector). This opens the Vuforia portal where you can create a new target database—a collection of image or object targets that will trigger your MR content to appear. Create a new database, add some image targets and then download the Unity package of the database. Adding this package to your Unity project imports your image targets. One important note: image targets need to have a lot of tracking points to provide the best MR target scanning experience. I’ve included some sample targets in the Assets/images folder of the Start Unity project. Feel free to use these images for your image targets. Once you’ve downloaded and installed the Unity package for your target database, click on the checkbox (in Vuforia Configuration) by the name of the target database you just loaded. This loads the target database into Unity for your app to use. Finally, click the Activate checkbox to enable the target database to be available on app start.

Before you leave the Vuforia Configuration view, there’s one more setting to check. Depending on your computer’s camera, you might need to change the WebCam settings at the bottom of the screen. I’m developing on a Surface Pro 4, and its cameras aren’t recognized by default. If you have a similar issue, open {Project Root}\Assets\Editor\QCAR\WebcamProfiles\profiles.xml and add the following XML:

<webcam deviceName="Microsoft Camera Rear">
  <windows>
    <requestedTextureWidth>640</requestedTextureWidth>
    <requestedTextureHeight>480</requestedTextureHeight>
    <resampledTextureWidth>640</resampledTextureWidth>
  </windows>
</webcam>

This will register the rear camera for the Surface Pro 4. If you want the front camera, just change the deviceName to Microsoft Camera Front.

Now the app is ready for an image target. Go to Game Object | Vuforia | Image. This will create an image target in your scene. In the Inspector | Image Target Behavior window, select Database to the target database you just loaded. Then select the image target you want to use from your database. The surface of the image target will now look like the image target selected.

To complete your MR scene, drag a 3D model from the Prefabs folder to be a child of the image target. When the AR Camera (which is the device’s camera at run time) sees the image target, the child 3D model will appear in the world over the image target. At this point, your MR app is ready. Make sure you have a printed copy of the image target you selected and give the app a run.

Getting Conversational

Now that the app can see an image target and generate a digital object, let’s create an interaction. I’m going to keep the interaction very simple and leverage a mix of the Windows Speech libraries that run on the device and the Speech API provided with Microsoft Cognitive Services. The questions the technician can ask, along with the answers, will be stored in the QnA Maker Service, one of the knowledge management offerings in Microsoft Cognitive Services. QnA Maker is a service that allows you to set up a question/answer pair using an existing FAQ, a formatted text file or even using the QnA input form through the service. Using your existing FAQ URLs or a text file exported from an existing knowledge management system lets you get up and running with the QnA Maker Service quickly and easily. Learn more about QnA Maker at qnamaker.ai.

Here’s the flow: The user activates listening mode using a keyphrase; the user asks a question, which is compared against the QnA Maker Service to find an answer; the answer is returned as text, which is then sent to the Speech API for transformation into an audio file; the audio is returned and played to the user; the user deactivates listening mode using a keyphrase.

To keep this simple, I won’t be using Language Understanding Intelligence Service (LUIS), but as you explore this solution, I encourage you to leverage LUIS to create a more natural conversation for users.

To start implementing the conversation components, I always create a Managers object to hold the various components needed to manage the scene overall. Create an empty Game Object called Managers and place it on the root of the scene.

Users of this app will start the conversation by saying “Hello Central.” This short statement is ideal for the KeywordRecognizer object. When the app hears Hello Central, it’s akin to saying “Hey Cortana”—it triggers the app to listen for more commands.

First, I’ll create the KeywordRecognizer to listen for the key words: Hello Central. To do this, I’ll create a new C# script in the Scripts folder called KeywordRecognizer, then, within the class, I’ll add a new class called KeywordAndResponse:

public class KeywordRecognizer : MonoBehaviour {
  [System.Serializable]
  public struct KeywordAndResponse {
    public string Keyword;
    public UnityEvent Response;
  }
  public KeywordAndResponse[] Keywords;
public bool StartAutomatically;
private UnityEngine.Windows.Speech.KeywordRecognizer recognizer;

Why Not Use HoloToolkit?

HoloToolkit is an excellent jumpstart to your HoloLens project. KeywordRecognizer already exists there and is much more robust than my sample, so why not use the HoloToolkit in this article? The reason is that I’m not building a HoloLens-only system. Instead, I’m setting the groundwork for you to build an app for the Surface Pro that can work on HoloLens but isn’t constrained to it. Not every company is ready to take on the cost of a HoloLens for their fleet. The concepts presented here allow you to build for the Samsung GearVR, Google ARCore and even the Apple ARKit. If you’re familiar with the HoloToolkit, you’ll notice that I’ve borrowed some of the implementation because, as I said, it’s very good. However, I’ve kept the code lighter and more concrete to work for the current scenario.

Next, I’ll implement the Start method to load the keywords and responses that will be provided through the Unity Editor inter­face to the app:

void Start () {
  List<string> keywords = new List<string>();
  foreach(var keyword in Keywords)
    keywords.Add(keyword.Keyword);
  this.recognizer =
    new UnityEngine.Windows.Speech.KeywordRecognizer(keywords.ToArray());
  this.recognizer.OnPhraseRecognized += KeywordRecognizer_OnPhraseRecognized;
  this.recognizer.Start();
}

In the Start method, I loop through each of the keywords to get a list of words. Then I register these words with the KeywordRecognizer, connect the OnPhraseRecognized event to a handler and finally start the recognizer. The OnPhraseRecognized method is called any time a phrase in the keywords list is heard by the app:

private void KeywordRecognizer_OnPhraseRecognized(
  PhraseRecognizedEventArgs args){
    foreach(var keyword in Keywords){
      if(keyword.Keyword == args.text){
        keyword.Response.Invoke();
        return;
    }
  }
}

This event loops through the word list to find the response, then calls the Invoke method to trigger the UnityEvent. By configuring the KeywordAndResponse object as I did, I can use the Unity­Editor to configure the KeywordRecognizer. As stated before, this is a lite implementation of the the more reobust HoloToolkit KeywordRecognizer code.

Once KeywordRecognizer hears Hello Central, the app needs to listen for the next command. One challenge in building a knowledge management app like this is that knowledge can change. To make that change easy to capture and communicate across any platform, I’ll use the QnA Maker Service to build a question-and-answer format for knowledge management. The QnA Maker Service is still in preview, but documentation can be found at qnamaker.ai.

I’m going to build another C# script component called Qand­AManager (create this script in your Scripts folder) to interact with the QnA Maker Service. Note that in this code I’ll be using the UnityEngine.Windows.Speech namespace, which is available only in Windows. If you want to do this for Android or iOS, you need to implement the Speech to Text API from Microsoft Cognitive Services, which can be found at bit.ly/2hKYoJh.

I’ll start the QandAManager with the following classes:

public class QandAManager : MonoBehaviour {
  [Serializable]
  public struct Question {
    public string question;
  }
  [Serializable]
  public struct Answer {
    public string answer;
    public double score;
  }

These classes will be used to serialize the questions and answers to send to the QnA Maker Service. Next, I’ll add some properties:

public string KnowledgeBaseId;
public string SubscriptionKey;
public AudioSource Listening;
public AudioSource ListeningEnd;
public string EndCommandText;
private DictationRecognizer recognizer;
private bool isListening;
private string currentQuestion;

KnowledgeBaseId will be used to determine which knowledge base to load, depending which image target is being viewed. SubsciptionKey is used to identify the app to the QnA Maker Service. Listening and ListeningEnd are audio cues to tell the user that the system is listening, or not. EndCommandText is the “hang up” command, in this case, “Thank You Central.” The rest I’ll discuss as they’re implemented.

StartListening is the public method that triggers the QandAManager to start listening:

public void StartListening() {
  if (!isListening) {
    PhraseRecognitionSystem.Shutdown();
    recognizer = new DictationRecognizer();
    recognizer.DictationError += Recognizer_DictationError;
    recognizer.DictationComplete += Recognizer_DictationComplete;
    recognizer.DictationResult += Recognizer_DictationResult;
    recognizer.Start();
    isListening = true;
    Listening.Play(0);
  }
}

This turns on DictationRecognizer, connects the events and plays the sound to indicate the app is now listening. Only one recognizer can be running at a given time, which means KeywordRecognizer needs to be shut down before the DictationRecognizer can start. PhraseRecognitionSystem.Shutdown stops all KeywordRecognizers, as shown in Figure 2.

Figure 2 Stopping All KeywordRecognizers

private void Recognizer_DictationResult(
  string text, ConfidenceLevel confidence) {
    if (confidence == ConfidenceLevel.Medium
      || confidence == ConfidenceLevel.High){
        currentQuestion = text;
    if (currentQuestion.ToLower() == EndCommandText.ToLower()) {
      StopListening();
    }
    else {
      GetResponse();
    }
  }
}

The Recognizer_DictationResult event takes the text the app hears and triggers the StopListening method if the EndCommandText phrase is recognized:

public void StopListening() {
  recognizer.Stop();
  isListening = false;
  ListeningEnd.Play(0);
  recognizer.DictationError -= Recognizer_DictationError;
  recognizer.DictationComplete -= Recognizer_DictationComplete;
  recognizer.DictationResult -= Recognizer_DictationResult;
  recognizer.Dispose();
  PhraseRecognitionSystem.Restart();
}

Otherwise, it performs GetResponse. For the sake of brevity, I won’t go into the DictationError and DictationComplete events as they don’t directly add to my solution. Please refer to the Complete Unity project to see an implementation.

StopListening shuts down the DictationRecognizer, disconnects all the events and then restarts the KeywordRecognizer. If the user initiates another interaction via Hello Central, the StartListening method will again be triggered and reconnect the DictationRecognizer.

The GetResponse method calls the getResponse coroutine:

private void GetResponse(){
  StartCoroutine(getResponse());
}

A coroutine is a function that can run across frames in Unity (for more information see bit.ly/2xJGT75). In this instance, getResponse is going to reach out to the QnA Maker Service, send a question and get back an answer, as shown in Figure 3. If you’re not a Unity developer, the JSONUtility class might be new to you. JSONUtility is a class built into Unity to serialize JSON objects to and from C# objects. In Figure 3, I’m using JsonUtility.ToJson to convert the Question object to a JSON representation to send to the QnA Maker Service.

Figure 3 The getResponse Coroutine

private IEnumerator getResponse(){
  string url = "https://westus.api.cognitive.microsoft.com/" +
    "qnamaker/v1.0/knowledgebases/{0}/generateAnswer";
  url = string.Format(url, KnowledgeBaseId);
  Question q = new Question() { question = currentQuestion };
  string questionJson = JsonUtility.ToJson(q);
  byte[] questionBytes =
    System.Text.UTF8Encoding.UTF8.GetBytes(questionJson);
  Dictionary<string, string> headers = new Dictionary<string, string>();
  headers.Add("Ocp-Apim-Subscription-Key", SubscriptionKey);
  headers.Add("Content-Type", "application/json");
  WWW w = new WWW(url, questionBytes, headers);
  yield return w;
  if (w.isDone){
    Answer answer = JsonUtility.FromJson<Answer>(w.text);
    TextToSpeechManager tts = GetComponent<TextToSpeechManager>();
    tts.Say(answer.answer);
  }
}

For brevity I didn’t include checking w.error to handle any error that comes from the QnA Maker Service, but you should make sure to handle errors accordingly.

In the code in Figure 3, you can see a reference to the TextToSpeechManager, which is the next component I’ll build. As its name suggests, TextToSpeechManager will take the text and turn it into an audio file. Before I do this, though, I’ll first build a component that allows the app to determine which knowledge base ID to use based on which image target is being viewed. 

Identifying the Target

In Vuforia, targets can implement the ITrackableEventHandler to perform custom actions when the target’s recognition status changes or even when the user changes from one target to another. In this app, I’m going to update the QandAManager KnowledgeBaseId property when the target is recognized.

I’ll create a new C# script in the Scripts folder called TargetTracker and add it to the Managers game object. Inside TargetTracker, I’ll add the ITrackableEventHandler interface:

public class TargetTracker : MonoBehaviour, ITrackableEventHandler {
  private TrackableBehaviour mTrackableBehaviour;
  public QandAManager qnaManager;
  public string KnowledgeBaseId;

The public KnowledgeBaseId value is a string you set to identify which KnowledgeBaseId to use when the target is recognized. The qnaManager object is a reference to the QandAManager component. Now I’ll configure the component in Start to connect the TrackableBehavior and QandAManger:

void Start () {
  mTrackableBehaviour = GetComponent<TrackableBehaviour>();
  if (mTrackableBehaviour) {
    mTrackableBehaviour.RegisterTrackableEventHandler(this);
  }
}

Finally, I’ll set the OnTrackableStateChanged event to update the knowledge base Id of the QandAManager with the knowledge base Id for this component:

public void OnTrackableStateChanged(
  TrackableBehaviour.Status previousStatus,
  TrackableBehaviour.Status newStatus){
    if (newStatus == TrackableBehaviour.Status.DETECTED ||
      newStatus == TrackableBehaviour.Status.TRACKED ||
      newStatus == TrackableBehaviour.Status.EXTENDED_TRACKED){
        qnaManager.KnowledgeBaseId = KnowledgeBaseId;
  }
}

With the code complete you can add the TargetTracker component to the Image Target in your scene. Next, take the subscription key from the QnA Maker Service and enter it into the QandAManger subscription key. Then take the knowledge base Id provided by the QnA Maker Service and add that to the TargetTracker associated with your scene’s image target. If you have multiple knowledge bases, you just need to update the knowledge base Id per target to switch which questions and answers users can ask per target.

When setting up the QnA Maker Service, remember to format your questions the way you expect users to ask them. Developers with experience working with users know that predicting user behavior, especially how users will ask a question, is very difficult. To help you build better questions and answers, consider implementing a telemetry tool like Application Insights or another logging mechanism to capture the questions users are asking. This will help you tailor your UX to deliver the best answers to the most common questions.

Talking Back

A conversation goes two ways. One way is the user talking to the app; the other is the app talking to the user. HoloToolkit has a great text-to-speech control, but for this article I’m going to set up and use the Cognitive Services Text to Speech API. Using the Cognitive Services Text to Speech capability allows the app to be more portable across platforms.

Before I start coding the solution, I should note that the Unity classes WWW, WWWForm and UnityWebRequest have a few limitations that create challenges for the code I’m about to write. One such limitation is that to get a token in the Speech API, you must POST an empty body to the token URL. WWW, WWWForm and UnityWebRequest all require you to provide something in the body of a post. But if you provide anything in the body, the Speech API will return a 400 Bad Request error. To solve this, I’m going to be using the C# WebRequest class, which doesn’t support coroutines. To this end, I’ll get the token for the speech API during the Start method of the component.

I’ll create a new C# script in the Scripts folder called TextToSpeechManager and add it to the Managers game object. This component will hold all of my text-to-speech capabilities and expose a public method called Say(string text) to be used by the QnAManager for saying the answer returned from the QnA Service. I’ll start the component with a few public variables:

public class TextToSpeechManager : MonoBehaviour {
  public string SubscriptionKey;
  public string SSMLMarkup;
  public Genders Gender = Genders.Female;
  public AudioSource audioSource;
  private string token;
  private string ssml;

SubscriptionKey stores the token the app will need to get an authentication token from the Speech API. This key is obtained from the Azure Portal by adding the Cognitive Services Speech API. SSMLMarkup is the XML content I’ll use to generate speech from the Speech API. For this article I’ve set up the default value to be:

<speak version='1.0' xml:lang='en-US'>
  <voice xml:lang='en-US'
    xml:gender='{0}'
    name='Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)'>{1}</voice>
</speak>

Here, {0} will be replaced with the Gender value and {1} will be the text to send to the Speech API.

Now I’ll add the Start method:

void Start(){
  ServicePointManager.ServerCertificateValidationCallback =     
    remoteCertificateValidationCallback;
  token = getToken();
}

Start begins by setting up a custom server certificate validation so that Unity can handle the authentication process with the Speech API. The complete code for the validation can be found in the Complete Unity project. Next, I call getToken, which connects to the Speech API authentication service and returns the token to use for future calls to the API:

private string getToken(){
  WebRequest request =
    WebRequest.Create("https://api.cognitive.microsoft.com/sts/v1.0/issueToken");
  request.Headers.Add("Ocp-Apim-Subscription-Key",
    SubscriptionKey);
  request.Method = "POST";
  WebResponse response = request.GetResponse();
  Stream responseStream = response.GetResponseStream();
  StreamReader reader = new StreamReader(responseStream);
  return reader.ReadToEnd();
}

The getToken method is a straightforward connection to the issueToken endpoint, passing in the subscription key and reading the token as a string returned from the API service. However, the token is obtained at start and might time out. In the Complete Unity project, you’ll find token refresh code.

The Say method is the public method called in the QandAManger. The signature for this method has a few different parts, with the basic flow being: prepare the SSML, send it to the Speech API, get the audio file, save the file and then load the file as an audio clip.

public void Say(string text){
  ssml = string.Format(
    SSMLMarkup,
    Enum.GetName(typeof(Genders), Gender),
    text);
  byte[] ssmlBytes =
    System.Text.UTF8Encoding.UTF8.GetBytes(ssml);

This first block of code gets the ssml and prepares it for transmission to the API by converting it to a byte array. Next, I prepare the request to the Speech API:

HttpWebRequest request =
  (HttpWebRequest)WebRequest.Create("https://speech.platform.bing.com/synthesize");
request.Method = "POST";
request.Headers.Add("X-Microsoft-OutputFormat", "riff-16khz-16bit-mono-pcm");
request.Headers.Add("Authorization", "Bearer " + token);
request.ContentType = "application/ssml+xml";
request.SendChunked = false;
request.ContentLength = ssmlBytes.Length;
request.UserAgent = "ContosoEnergy";

In the headers I add X-Microsoft-OutputFormat and set it to riff-16khz-16bit-mono-pcm so the Speech API will return a .wav file. Unity needs a .wav file to stream the audio clip later in the say method. Notice the request.SendChunked = false statement. This ensures the Transfer Encoding property isn’t set to chunked, which would cause a timeout (408) error when connecting to the Speech API. I also update the UserAgent to be ContosoEnergy because the default Unity value is too long for the Speech API to accept.

With the request and headers prepared, I write the ssmlBytes to the request stream:

Stream postData = request.GetRequestStream();
postData.Write(ssmlBytes, 0, ssmlBytes.Length);
postData.Close();

Next, I get the response and prepare the file path for saving the audio file that came back from the Speech API:

HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string path = string.Format("{0}\\assets\\tmp\\{1}.wav",
  System.IO.Directory.GetCurrentDirectory(),
  DateTime.Now.ToString("yyyy_mm_dd_HH_nn_ss"));

Then I save the audio file and call the say coroutine:

using (Stream fs = File.OpenWrite(path))
using (Stream responseStream = response.GetResponseStream()){
  byte[] buffer = new byte[8192];
  int bytesRead;
  while ((bytesRead =
    responseStream.Read(buffer, 0, buffer.Length)) > 0){
      fs.Write(buffer, 0, bytesRead);
  }
}
StartCoroutine(say(path));
}

The last method needed here is the coroutine to read the .wav file and play it as audio. The WWW object has a built-in www.Get­AudioClip function that makes it easy to load and play a file:

private IEnumerator say(string path) {
  WWW w = new WWW(path);
  yield return w;
  if (w.isDone) {
    audioSource.clip = w.GetAudioClip(false, true, AudioType.WAV);
    audioSource.Play();
  }
}

As before, I omitted the w.error check for brevity, but always make sure to handle errors in your code.

Wrapping Up

In this article I built a basic marker-based MR application using Vuforia and Unity. By using image targets, I set up the basic capabil­ity to scan an image to generate digital content. I then implemented the Windows KeywordRecognizer and DictationRecognizer components to allow the app to listen to the user. I took the text from the DictationRecognizer and enabled the app to respond to questions via the QnA Maker Service. Finally, I enabled the app to talk back to the user with the Cognitive Services Speech API.

The goal of combining MR and Cognitive Services is to create immersive experiences in which the physical and digital worlds blur into one cohesive interaction. I encourage you to download the code and extend it with LUIS or the Computer Vision APIs to bring the user even further into your app.


Tim Kulp is the director of Emerging Technology at Mind Over Machines in Baltimore, Md. He’s a mixed reality, artificial intelligence and cloud app developer, as well as author, painter, dad, and “wannabe mad scientist maker.” Find him on Twitter: @tim\_kulp or via LinkedIn: linkedin.com/in/timkulp.

Thanks to the following technical experts for reviewing this article: Alessandro Del Sole (Microsoft) and Will Gee (BaltiVirtual)
Alessandro Del Sole for Microsoft Cognitive Services & Will Gee from BaltiVirtual for Unity/Vuforia content. Will Gee is has been working in realtime 3D graphics for over 20 years, starting in the video game industry at MicroProse software. Most recently he founded Balti Virtual, a full-service AR/VR software development studio.


Discuss this article in the MSDN Magazine forum