Appendix B -- Processing Grammar Returns

Article
03/26/2013

Extracting information from returned strings in a VoiceXML application is accomplished with JavaScript (see http://www.w3schools.com/js/default.asp), primarily using string functions and regular expressions.

When you use a grammar written by someone other than yourself, you have to deal with a match return format that is not of your choosing. Frequently, the match return will be a string that you must manipulate. This appendix shows you how you can use JavaScript to process various types of strings that are returned by grammars that someone else wrote.

Appendix B includes the following sections:

When to use JavaScript
Converting a string to a number
Verifying the number of characters or digits in a returned string
Regular expressions
Extracting a piece of a string
Eliminating prefixes and suffixes with regular expressions

When to use JavaScript

The builtin grammars and Tellme public grammars discussed in Lesson 5 all return match values as strings. Furthermore, data is often concatenated in the string. For example, the date 4/15/1989 might be returned as the string “19890415”.

You will need to extract the information in returned strings for several purposes, including:

Validating that the information is in the correct form (e.g., the correct number of digits).
Generating text that the application can read to the caller for confirmation.
Passing data to a back-end application or database.

Converting a string to a number

Sometimes, you would like an actual integer or floating point number to manipulate, although the match returns the number as a string. As an example, a caller wants to transfer a sum of money between checking and savings accounts. Your transaction program requires a number such as 400.00 that it can process numerically, but the builtin currency grammar you are using returns the string "400.00".

You can use the JavaScript parseInt() and parseFloat() functions to convert strings to numbers, as follows:

x = parseInt(string); where x is an integer

y = parseFloat(string); where y is an floating point number

The parseInt() and parseFloat() functions read the number from the left, stopping when they encounter a non-digit. For example, parseInt("305C43AQ"); returns 305. If the first character in the string is not a digit, the parse functions return NaN (not a number).

Verifying the number of characters or digits in a returned string

In many cases, you want to verify the number of characters in the returned string. For example, a Social Security Number (SSN) must be nine digits.

Verifying the number of characters

It is easy to verify the number of characters returned in the string SSN with the String length property:

   <script> <![CDATA[
       numChar = SSN.length;
       if (numChar !== 9) {
          <!-- throw an error causing a reprompt-->
       }
   ]]> </script>

This code verifies that there are nine characters, but doesn't guarantee that they are digits.

Verifying the number of digits

The parseInt() function can help here. The parseInt() function reads the number from the left, stopping when it encounters a non-digit. So, if all the characters in a string are digits (as they should be) and we convert the string to an integer with parseInt(), then the integer will have the same number of characters as the original string. Then, if we convert the integer back to a string, the original and final strings should both be equal in length (and, in the case of a SSN, nine characters in length):

   <script> <![CDATA[
       intSSN = parseInt(SSN);
       newSSN = new String(intSSN);
       if (newSSN.length !== SSN.length || SSN.length !== 9) {
          <!-- throw an error causing a reprompt-->
       }
   ]]> </script>

Regular expressions

Regular expressions are a very powerful (and, at first, very difficult) part of the JavaScript language. You use a regular expression to match patterns of text, which can then be manipulated by your script. See http://www.w3schools.com/jsref/jsref_obj_regexp.asp for a concise introduction to regular expressions.

Using regular expressions to space characters in a string

Grammars frequently return numeric data as a string of digits. For example, a SSN might be returned as a string of nine digits such as ssn = “123121234”.

In a typical VoiceXML application, the caller enters the SSN with speech or DTMF and the application uses TTS to repeat it back to the caller for verification. With the above string, you would like the TTS to say “You entered 1 2 3 1 2 1 2 3 4. Is that correct?” However, the TTS will actually say “You entered one hundred and twenty three million, one hundred and twenty one thousand, two hundred and thirty four. Is that correct?”

To get the TTS to say what we want it to, we must alter the string so that it has spaces between the digits, like this: “1 2 3 1 2 1 2 3 4”. This is easily accomplished using a regular expression in the String function replace():

   ssn= ssn.replace(/(.)/g, "RegExp.$1 ");

Here, we are using replace() to find every character in the string (no matter how many there are) and replace it with itself plus a space.

Explanation of the code:

All regular expressions begin and end with a forward slash, so /./ is a regular expression. The dot means “match any character except a newline.” When g is appended to the regular expression, it means to search “globally.” That is, when the regular expression is found, don’t stop--repeat the search on the remaining string until the expression isn’t found anymore.
Why is the dot enclosed in parenthesis? Does that alter the search expression? No, the parentheses are there for a different purpose. When a regular expression match is made, the regular expression object, RegExp, is created. RegExp has a number of properties, including $1, which is the value of the first part of the match expression that is enclosed in parentheses. ($2 is the value of the second part of the match expression that is enclosed in parentheses, and so forth. In this present case, (.) is the first part and there is no second part).
So, this replace() function is going to find every character (there aren’t any newlines) and replace it with itself plus a space. "RegExp.$1" is the found character and "RegExp.$1 " is the found character plus a space.

Extracting a piece of a string

You can extract pieces of a string with the JavaScript substring() function or with regular expressions.

Extracting a piece of a string with the substring() function

A number of the date grammars that were discussed in Lesson 5 returned their matches as date = “YYYYMMDD”.

The String method substring(indexA,indexB) extracts the characters in a string starting with the character at indexA (counting from 0) and ending with the character at (indexB - 1).

To extract the year, month, and day from the string date with format “YYYYMMDD”, use this code:

year = date.substring(0,4);
month = date.substring(4, 6);
day = date.substring (6);

Note

If the second argument in substring() is missing, the method extracts the substring from the argument to the end of the string.

Extracting a piece of a string with regular expressions

As noted earlier, the Tellme public grammar for dates returns a string of the form:

"day=Monday^date=15^month=04^year=2010^special_date=last"

Here is a function to extract the pieces:

<script><![CDATA[
// parse match returns of the form
// name1=value1^name2=value2^name3=value3 into a JavaScript object
function parseDate(s) {
   var obj = {};
   var parts = s.split(/\^/);
   for (var i = 0; i < parts.length; i++) {
      if (/^([^=]+)=(.*)$/.test(parts[i])) {
         obj[RegExp.$1] = RegExp.$2;
      }
   }
   return obj;
}  
]]></script>

This script decomposes a string like “name1=value1^name2=value2^name3=value3” into a JavaScript object. Such objects can be treated interchangeably in two different ways:

As an associative array such as object = (name2:value2, name3:value3, name1:value1). Note that the name/value pairs do not have to be in any particular order.
In dot notation like object.name1=value1, object.name2=value2, etc.

Here’s how the script works:

The object variable is declared in the first line of the function.

The split() method splits a string into an array of shorter strings. It splits on the input parameter. So, we create a new array by splitting on the ‘^’ character (as represented by the regular expression /\^/).

Note

The split parameter could also have been written without a regular expression as split(‘^’).

After the split we have an array named parts = {name1=value1, name2=value2, name3=value3}. The for loop looks at each element of the parts array in turn, as follows:

The condition in the if statement tests each array element to see if it is of the form xxx=yyy as it should be. The array element should match the regular expression /^([^=]+)=(.*)$/.

The first caret (^) marks the start of the regular expression and the last dollar sign ($) marks the end. In between we have:

([^=]+) matches any number of characters from left to right as long as they are not =.
= matches =
(.*) matches any number of characters, including zero, as long as they are not carriage returns or line feeds. This means that each of value1, value2, and value3 can either have a value or be an empty string.

/regex expression/.test(x) returns true if x matches the regular expression.

If the match is true, obj[RegExp.$1] = RegExp.$2; assigns a name and value to one of the obj object’s associative array elements (the order of elements doesn’t matter). RegExp.$n refers to the nth matched expression that is in parenthesis, counting from the left. So RegExp.$1 is whatever matches ([^=]+) before the = sign, and RegExp.$2 is whatever matches (.*) after the = sign.

After the for loop is completed, we have

obj = {name1:value1, name2:value2, name3:value3)

although it does not have to be in order 1,2,3,…

Eliminating prefixes and suffixes with regular expressions]

A commonly used, commercially available grammar for the names of US cities and states returns its matches as a string with the following format:

xx^new_york@ny:8338

The xx prefix (the prefix actually is xx—xx does not represent something else) is the same for all city/state combinations. The suffix number has to do with the city/state database and varies from city to city. When city names have two or more words, they are returned with an underscore between the words—for example, Santa_Fe, NM or Fond_du_Lac, WI. States are represented by their two character USPS name.

The following short script:

Separates the city and state names.
Eliminates the xx prefix and the numerical suffix.
Removes the underscore in the city name, if any.

<script><![CDATA[
// the field item variable with the grammar match is city_state 
// city_state = xx^new_york@ny:8338

var parts = city_state.split(/@/);
if (parts.length > 1) {
   // get city/state
   city = parts[0].split('^')[1];
   state = parts[1].split(':')[0];
   // get rid of underscore in city, if any
   city = city.replace(/(_)/g, " ");
}
]]></script>

Here is how it works:

First, the city_state string is split on the '@' character. This results in a two-element array, parts, where parts[0] = xx^new_york and parts[1] = ny:8338.
parts[0] is split on the '^' character into a two-element array parts[0].split('^'). The first element is xx and the second element, parts[0].split('^')[1], is the city name that we want.
parts[1] is split on the ':' character into the two-element array parts[1].split(':'). The first element, parts[1].split(':')[0], is ny (the state name that we want) and the second element is 8338 (which we don't want).
Finally, city = city.replace(/(_)/g, " "); globally replaces all underscores in the city name with spaces.

We are left with:

city = "new york" and state = "ny"