Microsoft Speech API overview

The cloud-based Microsoft Speech API provides developers an easy way to create powerful speech-enabled features in their applications, like voice command control, user dialog using natural speech conversation, and speech transcription and dictation. The Microsoft Speech API supports both Speech to Text and Text to Speech conversion.

  • Speech to Text API converts human speech to text that can be used as input or commands to control your application.
  • Text to Speech API converts text to audio streams that can be played back to the user of your application.

Speech to text (speech recognition)

Microsoft speech recognition API transcribes audio streams into text that your application can display to the user or act upon as command input. It provides two ways for developers to add Speech to their apps.

  • REST APIs: Developers can use HTTP calls from their apps to the service for speech recognition.
  • Client libraries: For advanced features, developers can download Microsoft Speech client libraries, and link into their apps. The client libraries are available on various platforms (Windows, Android, iOS) using different languages (C#, Java, JavaScript, ObjectiveC).
Use cases REST APIs Client Libraries
Convert a short spoken audio, for example, commands (audio length < 15 s) without interim results Yes Yes
Convert a long audio (> 15 s) No Yes
Stream audio with interim results desired No Yes
Understand the text converted from audio using LUIS No Yes

Whichever approach developers choose (REST APIs or client libraries), Microsoft speech service supports the following:

  • Advanced speech recognition technologies from Microsoft that are used by Cortana, Office Dictation, Office Translator, and other Microsoft products.
  • Real-time continuous recognition. The speech recognition API enables users to transcribe audio into text in real time, and supports to receive the intermediate results of the words that have been recognized so far. The speech service also supports end-of-speech detection. In addition, users can choose additional formatting capabilities, like capitalization and punctuation, masking profanity, and text normalization.
  • Supports optimized speech recognition results for interactive, conversation, and dictation scenarios. For user scenarios which require customized language models and acoustic models, Custom Speech Service allows you to create speech models that tailored to your application and your users.
  • Support many spoken languages in multiple dialects. For the full list of supported languages in each recognition mode, see recognition languages.
  • Integration with language understanding. Besides converting the input audio into text, the Speech to Text provides applications an additional capability to understand what the text means. It uses the Language Understanding Intelligent Service(LUIS) to extract intents and entities from the recognized text.

What's next

Text to speech (speech synthesis)

Text to Speech APIs use REST to convert structured text to an audio stream. The APIs provide fast text to speech conversion in various voices and languages. In addition users also have the ability to change audio characteristics like pronunciation, volume, pitch etc. using SSML tags.

What's next