Customized speech vocabularies in Windows Vista

Some of my blog readers have been wondering if they can use customized vocabularies in Windows Vista. Before I answer that, I should probably describe a little bit about how a speech recognizer works, so you can appreciate why someone might want to know about customized vocabularies.

A speech recognizer’s job is a difficult one. It has to take tens of thousands of samples of acoustic data (per second!) in from the microphone, and try to figure out what the user said. In order to map this acoustic data to something the user said, the recognizer needs to know up front what the user might actually say.

For command and control (which maps what the user said into an action, like “File”, or “OK”) the phrases that the user might say are specified by the application, or in the case of Windows Vista, these phrases are specified by the speech user experience component (SPUX).

For dictation (which maps what the user said into text that is injected into the current document), the phrases aren’t specified up front by the application, but instead they come from a large dictionary of words that the speech recognizer ships with. This dictionary in Windows Vista has over 100,000 words in it for English.

100,000 words might seem like a lot, but it doesn’t contain all the words a user might say. There are words that we haven’t added to the dictionary yet, like proper names, words that are specific to certain domains (like legal or medical transcribers), etc.

There’s also a part of the system, called a language model, that is applied. You can think of a language model as a way to incorporate the frequency of use of a word in specific contexts. For example, if you said “r eh d”, you might mean either “red” or “read”. Using the context of the words around “r eh d” will help the recognizer know which one to use. The recognizer does this for more than just homo-phones, it uses this context for all words it recognizes for dictation.

So, when blog readers ask the question, will “Vista speech recognition will be able to support customized vocabularies”, there are a couple answers…

1.) Yes, users will be able to add their own customized words to the dictionary. That will allow users to add their own phrases that aren’t already in the dictionary. For example, my boss’s name is John Tippett. “Tippett” is not in the dictionary. So, he’ll want to add that. You can add a few hundred, perhaps even a few thousand words to that dictionary. But you shouldn’t try to add 100,000 medical terms to that dictionary. It’s not built for that purpose.

2.) Microsoft doesn’t have any plans to ship custom language models for legal or medical (the top two requested language models) in Windows Vista. The underlying technology for the Windows Vista speech recognizer does allow for multiple language models to be used, however. In fact, we’re using a different language model for the spelling experience in Windows Vista than we do for normal dictation.

“What’s the spelling experience”, you might ask? That’s something I hope to describe more in a future post — but basically it’s the part of the user experience that allows you to enter a name by spelling the letters in a unique way that doesn’t require you to know the military alphabet. I bet you’ll really like it once you see it…