Get started with the speech recognition service library in C# for .NET Windows
The service library is for developers who have their own cloud service and want to call Speech Service from their service. If you want to call the speech recognition service from device-bound apps, do not use this SDK. (Use other client libraries or REST APIs for that.)
The following sections describe how to install, build, and run the C# sample application by using the C# service library.
The following example was developed for Windows 8+ and .NET 4.5+ Framework by using Visual Studio 2015, Community Edition.
Get the sample application
Clone the sample from the Speech C# service library sample repository.
Subscribe to the Speech Recognition API, and get a free trial subscription key
The Speech API is part of Cognitive Services (previously Project Oxford). You can get free trial subscription keys from the Cognitive Services subscription page. After you select the Speech API, select Get API Key to get the key. It returns a primary and secondary key. Both keys are tied to the same quota, so you can use either key.
Get a subscription key. Before you can use the Speech client libraries, you must have a subscription key.
Use your subscription key. With the provided C# service library sample application, you need to provide your subscription key as one of the command-line parameters. For more information, see Run the sample application.
Step 1: Install the sample application
Start Visual Studio 2015, and select File > Open > Project/Solution.
Double-click to open the Visual Studio 2015 Solution (.sln) file named SpeechClient.sln. The solution opens in Visual Studio.
Step 2: Build the sample application
Press Ctrl+Shift+B, or select Build on the ribbon menu. Then select Build Solution.
Step 3: Run the sample application
After the build is finished, press F5 or select Start on the ribbon menu to run the example.
Open the output directory for the sample, for example, SpeechClientSample\bin\Debug. Press Shift+Right-click, and select Open command window here.
SpeechClientSample.exewith the following arguments:
- Arg: Specify an input audio WAV file.
- Arg: Specify the audio locale.
- Arg: Specify the recognition modes: Short for the
ShortPhrasemode and Long for the
- Arg: Specify the subscription key to access the speech recognition service.
ShortPhrasemode: An utterance up to 15 seconds long. As data is sent to the server, the client receives multiple partial results and one final best result.
LongDictationmode: An utterance up to 10 minutes long. As data is sent to the server, the client receives multiple partial results and multiple final results, based on where the server indicates sentence pauses.
Supported audio formats
The Speech API supports audio/WAV by using the following codecs:
- PCM single channel
To create a SpeechClient, you need to first create a Preferences object. The Preferences object is a set of parameters that configures the behavior of the speech service. It consists of the following fields:
SpeechLanguage: The locale of the audio sent to the speech service.
ServiceUri: The endpoint used to call the speech service.
AuthorizationProvider: An IAuthorizationProvider implementation used to fetch tokens in order to access the speech service. Although the sample provides a Cognitive Services authorization provider, we highly recommend that you create your own implementation to handle token caching.
EnableAudioBuffering: An advanced option. See Connection management.
The SpeechInput object consists of two fields:
Audio: A stream implementation of your choice from which the SDK pulls audio. It can be any stream that supports reading.
The SDK detects the end of the stream when the stream returns 0 in read.
RequestMetadata: Metadata about the speech request. For more information, see the reference.
After you have instantiated a SpeechClient and SpeechInput objects, use RecognizeAsync to make a request to Speech Service.
var task = speechClient.RecognizeAsync(speechInput);
After the request finishes, the task returned by RecognizeAsync finishes. The last RecognitionResult is the end of the recognition. The task can fail if the service or the SDK fails unexpectedly.
Speech recognition events
Partial results event
This event gets called every time Speech Service predicts what you might be saying, even before you finish speaking (if you use
MicrophoneRecognitionClient) or finish sending data (if you use
DataRecognitionClient). You can subscribe to the event by using
SpeechClient.SubscribeToPartialResult(). Or you can use the generic events subscription method
|LexicalForm||This form is optimal for use by applications that need raw, unprocessed speech recognition results.|
|DisplayText||The recognized phrase with inverse text normalization, capitalization, punctuation, and profanity masking applied. Profanity is masked with asterisks after the initial character, for example, "d***." This form is optimal for use by applications that display the speech recognition results to a user.|
|Confidence||The level of confidence the recognized phrase represents for the associated audio as defined by the speech recognition server.|
|MediaTime||The current time relative to the start of the audio stream (in 100-nanosecond units of time).|
|MediaDuration||The current phrase duration/length relative to the audio segment (in 100-nanosecond units of time).|
When you finish speaking (in
ShortPhrase mode), this event is called. You're provided with n-best choices for the result. In
LongDictation mode, the event can be called multiple times, based on where the server indicates sentence pauses. You can subscribe to the event by using
SpeechClient.SubscribeToRecognitionResult(). Or you can use the generic events subscription method
|RecognitionStatus||The status on how the recognition was produced. For example, was it produced as a result of successful recognition or as a result of canceling the connection, etc.|
|Phrases||The set of n-best recognized phrases with the recognition confidence.|
For more information on recognition results, see Output format.
Speech recognition response
Speech response example:
--- Partial result received by OnPartialResult ---what --- Partial result received by OnPartialResult --what's --- Partial result received by OnPartialResult ---what's the web --- Partial result received by OnPartialResult ---what's the weather like ---***** Phrase Recognition Status = [Success] ***What's the weather like? (Confidence:High) What's the weather like? (Confidence:High)
The API utilizes a single WebSocket connection per request. For optimal user experience, the SDK attempts to reconnect to Speech Service and start the recognition from the last RecognitionResult that it received. For example, if the audio request is two minutes long, the SDK received a RecognitionEvent at the one-minute mark, and a network failure occurred after five seconds, the SDK starts a new connection that starts from the one-minute mark.
The SDK doesn't seek back to the one-minute mark because the stream might not support seeking. Instead, the SDK keeps an internal buffer that it uses to buffer the audio and clears the buffer as it receives RecognitionResult events.
By default, the SDK buffers audio so that it can recover when a network interrupt occurs. In a scenario where it's preferable to discard the audio lost during the network disconnect and restart the connection, it's best to disable audio buffering by setting
EnableAudioBuffering in the Preferences object to