Recognizing answering machines

Recently, I was posed with the question of how to do answering machine detection using Microsoft Speech Server 2007 Beta.  The idea is an app makes an outbound call and reaches an answering machine.  If we reach an answering machine, we want to take different steps than if we reached a human.  For instance, we may play a message and then hang up or we may just hang up and retry again.

It was suggested to me by one of my team members that using a HMIHY grammar may help solve this problem.  Supposedly, a researcher had tried this using a list of common answering machine messages and had suprisingly good success.  I tried to obtain the grammar on behalf of the partner but was not able to because the grammar is MS intellectual property.  So, I decided to go about this myself.

For those not familiar with HMIHY grammars, it stands for "How May I Help You".  The common scenario for this is a support department, where you state what problem you are having and it routes you to the correct department.  HMIHY grammars allow for more natural speech utterances and are supported in MSS 2007 Beta through the Conversational Grammar Builder.

So in a nutshell my test app consisted of a single QuestionAnswerActivity with two grammars attached - the HMIHY grammar that recognizes answering machines - and a 'normal' grammar that recognized different versions of "yes", and "no".  My app then plays a message depending on which of the three results it heard.  In the case of no reco, it replays the prompt.

My first big problem was getting the training sentences for the HMIHY grammar.  The way Conversational Grammar Builder works with HMIHY grammars is to add a number of training sentences indicative of a certain result.  So I needed a list of training sentences consisting of standard answering machine messages.

I did a and Google search (interesting that gave better results) for answering machine messages and I could only find canonical lists of silly answering machine messages.  This was very frustrating, until it occurred to me that in general they consisted of very similar messages.  They all contained different versions of "hello, we're not home, leave a message" so maybe with enough of these sentences it would still work?

It took me some time to caress several hundred of these messages into a text file with one response per line and get the recognition engine to accept it.  A number of words (like 'zorbot') were not recognized and had to be removed or replaced.  I could have let the engine use heuristics to determine the pronunciation but most of these I would not expect on an answering machine either.  I found it interesting that most profane words were not recognized by the HMIHY engine - I would expect these to be common enough and not entirely unwarranted for support desks -  "My <bleep> product won't do what you <bleep> said it would! <bleep>".

In order to create a HMIHY grammar in CGB, you must create a new response container in the lower tree, and add responses indicating the possible choices.  For each response, you add training sentences that would match that response.  If you're still unsure, there is a sample app that demanstrates HMIHY grammars that you can reference.  So I imported my training sentences, built the project in a few minutes with the Dialog Flow Designer, and attached both grammars by clicking the 'add dynamic grammars' link and just adding both grammars to the collection.

What I quickly found while debugging the app is it would always try to use the HMIHY grammar and never use the 'normal' grammar.  I tried changing the weights of the grammars in favor of the 'normal' grammar and I was still not successful.  After talking with others on my team, I found out something very important.

If you have a HMIHY grammar rule attached to a QA, do not add any other grammars to that QA.

The problem is, while trying to 'trick' the engine we end up hurting ourselves.  The correct thing to do is add the results for 'no' and 'yes' to the HMIHY grammar and use only one grammar for the QA.  When I did this I found that my theory was in general correct - I was able to recognize 'normal' messages even though I had only trained with silly ones.  My grammar was able to recognize the following messages as coming from an answering machine:

Hello, you have reached the house of Joe Calev. I am not in right now. Please leave a message after the beep.
Please leave a message after the beep.
Leave a message.

OK, so obviously before something like this could be deployable it would have to be tested in a variety of scenarios in a much more rigorous way than I did, but I still found it very hard to trick the grammar.

Unfortunately, due to the fact that my training sentences could be found offensive I'm not able to post them here nor am I able to send them to the partner, but I did demonstrate enough to show that answering machine detection is possible using HMIHY grammars.