Get started with text-to-speech

In this quickstart, you learn common design patterns for doing text-to-speech synthesis using the Speech SDK. You start by doing basic configuration and synthesis, and move on to more advanced examples for custom application development including:

  • Getting responses as in-memory streams
  • Customizing output sample rate and bit rate
  • Submitting synthesis requests using SSML (speech synthesis markup language)
  • Using neural voices

Skip to samples on GitHub

If you want to skip straight to sample code, see the C# quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Import dependencies

To run the examples in this article, include the following using statements at the top of your script.

using System;
using System.IO;
using System.Text;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

In this example, you create a SpeechConfig using a subscription key and region. See the region support page to find your region identifier. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

public class Program 
{
    static async Task Main()
    {
        await SynthesizeAudioAsync();
    }

    static async Task SynthesizeAudioAsync() 
    {
        var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    }
}

Synthesize speech to a file

Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

To start, create an AudioConfig to automatically write the output to a .wav file, using the FromWavFileOutput() function, and instantiate it with a using statement. A using statement in this context automatically disposes of unmanaged resources and causes the object to go out of scope after disposal.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var audioConfig = AudioConfig.FromWavFileOutput("path/to/write/file.wav");
}

Next, instantiate a SpeechSynthesizer with another using statement. Pass your config object and the audioConfig object as params. Then, executing speech synthesis and writing to a file is as simple as running SpeakTextAsync() with a string of text.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var audioConfig = AudioConfig.FromWavFileOutput("path/to/write/file.wav");
    using var synthesizer = new SpeechSynthesizer(config, audioConfig);
    await synthesizer.SpeakTextAsync("A simple test to write to a file.");
}

Run the program, and a synthesized .wav file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

Synthesize to speaker output

In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, simply omit the AudioConfig param when creating the SpeechSynthesizer in the example above. This outputs to the current active output device.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var synthesizer = new SpeechSynthesizer(config);
    await synthesizer.SpeakTextAsync("Synthesizing directly to speaker output.");
}

Get result as an in-memory stream

For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:

  • Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • Integrate the result with other API's or services.
  • Modify the audio data, write custom .wav headers, etc.

It's simple to make this change from the previous example. First, remove the AudioConfig block, as you will manage the output behavior manually from this point onward for increased control. Then pass null for the AudioConfig in the SpeechSynthesizer constructor.

Note

Passing null for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

This time, you save the result to a SpeechSynthesisResult variable. The AudioData property contains a byte [] of the output data. You can work with this byte [] manually, or you can use the AudioDataStream class to manage the in-memory stream. In this example you use the AudioDataStream.FromResult() static function to get a stream from the result.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var synthesizer = new SpeechSynthesizer(config, null);
    
    var result = await synthesizer.SpeakTextAsync("Getting the response as an in-memory stream.");
    using var stream = AudioDataStream.FromResult(result);
}

From here you can implement any custom behavior using the resulting stream object.

Customize audio format

The following section shows how to customize audio output attributes including:

  • Audio file type
  • Sample-rate
  • Bit-depth

To change the audio format, you use the SetSpeechSynthesisOutputFormat() function on the SpeechConfig object. This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. See the reference docs for a list of audio formats that are available.

There are various options for different file types depending on your requirements. Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

In this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    config.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

    using var synthesizer = new SpeechSynthesizer(config, null);
    var result = await synthesizer.SpeakTextAsync("Customizing audio output format.");

    using var stream = AudioDataStream.FromResult(result);
    await stream.SaveToWaveFileAsync("path/to/write/file.wav");
}

Running your program again will write a .wav file to the specified path.

Use SSML to customize speech characteristics

Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

To start using SSML for customization, you make a simple change that switches the voice. First, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. This example changes the voice to a male English (UK) voice. Note that this voice is a standard voice, which has different pricing and availability than neural voices. See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the SpeakTextAsync() function, you use SpeakSsmlAsync(). This function expects an XML string, so you first load your SSML config as a string using File.ReadAllText(). From here, the result object is exactly the same as previous examples.

Note

If you're using Visual Studio, your build config likely will not find your XML file by default. To fix this, right click the XML file and select Properties. Change Build Action to Content, and change Copy to Output Directory to Copy always.

public static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var synthesizer = new SpeechSynthesizer(config, null);
    
    var ssml = File.ReadAllText("./ssml.xml");
    var result = await synthesizer.SpeakSsmlAsync(ssml);

    using var stream = AudioDataStream.FromResult(result);
    await stream.SaveToWaveFileAsync("path/to/write/file.wav");
}

The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

Neural voices

Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

To switch to a neural voice, change the name to one of the neural voice options. Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. Use the style param to customize the speaking style. This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

Important

Neural voices are only supported for Speech resources created in East US, South East Asia, and West Europe regions.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

In this quickstart, you learn common design patterns for doing text-to-speech synthesis using the Speech SDK. You start by doing basic configuration and synthesis, and move on to more advanced examples for custom application development including:

  • Getting responses as in-memory streams
  • Customizing output sample rate and bit rate
  • Submitting synthesis requests using SSML (speech synthesis markup language)
  • Using neural voices

Skip to samples on GitHub

If you want to skip straight to sample code, see the C++ quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Import dependencies

To run the examples in this article, include the following import and using statements at the top of your script.

#include <iostream>
#include <fstream>
#include <string>
#include <speechapi_cxx.h>

using namespace std;
using namespace Microsoft::CognitiveServices::Speech;
using namespace Microsoft::CognitiveServices::Speech::Audio;

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

In this example, you create a SpeechConfig using a subscription key and region. See the region support page to find your region identifier. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

int wmain()
{
    try
    {
        synthesizeSpeech();
    }
    catch (exception e)
    {
        cout << e.what();
    }
    return 0;
}
    
void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
}

Synthesize speech to a file

Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

To start, create an AudioConfig to automatically write the output to a .wav file, using the FromWavFileOutput() function.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
}

Next, instantiate a SpeechSynthesizer, passing your config object and the audioConfig object as params. Then, executing speech synthesis and writing to a file is as simple as running SpeakTextAsync() with a string of text.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
    auto synthesizer = SpeechSynthesizer::FromConfig(config, audioConfig);
    auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
}

Run the program, and a synthesized .wav file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

Synthesize to speaker output

In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, simply omit the AudioConfig param when creating the SpeechSynthesizer in the example above. This outputs to the current active output device.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto synthesizer = SpeechSynthesizer::FromConfig(config);
    auto result = synthesizer->SpeakTextAsync("Synthesizing directly to speaker output.").get();
}

Get result as an in-memory stream

For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:

  • Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • Integrate the result with other API's or services.
  • Modify the audio data, write custom .wav headers, etc.

It's simple to make this change from the previous example. First, remove the AudioConfig, as you will manage the output behavior manually from this point onward for increased control. Then pass NULL for the AudioConfig in the SpeechSynthesizer constructor.

Note

Passing NULL for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

This time, you save the result to a SpeechSynthesisResult variable. The GetAudioData getter returns a byte [] of the output data. You can work with this byte [] manually, or you can use the AudioDataStream class to manage the in-memory stream. In this example you use the AudioDataStream.FromResult() static function to get a stream from the result.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
    
    auto result = synthesizer->SpeakTextAsync("Getting the response as an in-memory stream.").get();
    auto stream = AudioDataStream::FromResult(result);
}

From here you can implement any custom behavior using the resulting stream object.

Customize audio format

The following section shows how to customize audio output attributes including:

  • Audio file type
  • Sample-rate
  • Bit-depth

To change the audio format, you use the SetSpeechSynthesisOutputFormat() function on the SpeechConfig object. This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. See the reference docs for a list of audio formats that are available.

There are various options for different file types depending on your requirements. Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

In this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    config->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm);

    auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
    auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
    
    auto stream = AudioDataStream::FromResult(result);
    stream->SaveToWavFileAsync("path/to/write/file.wav").get();
}

Running your program again will write a .wav file to the specified path.

Use SSML to customize speech characteristics

Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

To start using SSML for customization, you make a simple change that switches the voice. First, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. This example changes the voice to a male English (UK) voice. Note that this voice is a standard voice, which has different pricing and availability than neural voices. See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the SpeakTextAsync() function, you use SpeakSsmlAsync(). This function expects an XML string, so you first load your SSML config as a string. From here, the result object is exactly the same as previous examples.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
    
    std::ifstream file("./ssml.xml");
    std::string ssml, line;
    while (std::getline(file, line))
    {
        ssml += line;
        ssml.push_back('\n');
    }
    auto result = synthesizer->SpeakSsmlAsync(ssml).get();
    
    auto stream = AudioDataStream::FromResult(result);
    stream->SaveToWavFileAsync("path/to/write/file.wav").get();
}

The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

Neural voices

Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

To switch to a neural voice, change the name to one of the neural voice options. Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. Use the style param to customize the speaking style. This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

Important

Neural voices are only supported for Speech resources created in East US, South East Asia, and West Europe regions.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

In this quickstart, you learn common design patterns for doing text-to-speech synthesis using the Speech SDK. You start by doing basic configuration and synthesis, and move on to more advanced examples for custom application development including:

  • Getting responses as in-memory streams
  • Customizing output sample rate and bit rate
  • Submitting synthesis requests using SSML (speech synthesis markup language)
  • Using neural voices

Skip to samples on GitHub

If you want to skip straight to sample code, see the Java quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Import dependencies

To run the examples in this article, include the following import statements at the top of your script.

import com.microsoft.cognitiveservices.speech.AudioDataStream;
import com.microsoft.cognitiveservices.speech.SpeechConfig;
import com.microsoft.cognitiveservices.speech.SpeechSynthesizer;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisOutputFormat;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisResult;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;

import java.io.*;
import java.util.Scanner;

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

In this example, you create a SpeechConfig using a subscription key and region. See the region support page to find your region identifier. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

public class Program 
{
    public static void main(String[] args) {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    }
}

Synthesize speech to a file

Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

To start, create an AudioConfig to automatically write the output to a .wav file using the fromWavFileOutput() static function.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    AudioConfig audioConfig = AudioConfig.fromWavFileOutput("path/to/write/file.wav");
}

Next, instantiate a SpeechSynthesizer passing your speechConfig object and the audioConfig object as params. Then, executing speech synthesis and writing to a file is as simple as running SpeakText() with a string of text.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    AudioConfig audioConfig = AudioConfig.fromWavFileOutput("path/to/write/file.wav");

    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.SpeakText("A simple test to write to a file.");
}

Run the program, and a synthesized .wav file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

Synthesize to speaker output

In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, instantiate the AudioConfig using the fromDefaultSpeakerOutput() static function. This outputs to the current active output device.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    AudioConfig audioConfig = AudioConfig.fromDefaultSpeakerOutput();

    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.SpeakText("Synthesizing directly to speaker output.");
}

Get result as an in-memory stream

For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:

  • Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • Integrate the result with other API's or services.
  • Modify the audio data, write custom .wav headers, etc.

It's simple to make this change from the previous example. First, remove the AudioConfig block, as you will manage the output behavior manually from this point onward for increased control. Then pass null for the AudioConfig in the SpeechSynthesizer constructor.

Note

Passing null for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

This time, you save the result to a SpeechSynthesisResult variable. The SpeechSynthesisResult.getAudioData() function returns a byte [] of the output data. You can work with this byte [] manually, or you can use the AudioDataStream class to manage the in-memory stream. In this example you use the AudioDataStream.fromResult() static function to get a stream from the result.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null);
    
    SpeechSynthesisResult result = synthesizer.SpeakText("Getting the response as an in-memory stream.");
    AudioDataStream stream = AudioDataStream.fromResult(result);
    System.out.print(stream.getStatus());
}

From here you can implement any custom behavior using the resulting stream object.

Customize audio format

The following section shows how to customize audio output attributes including:

  • Audio file type
  • Sample-rate
  • Bit-depth

To change the audio format, you use the setSpeechSynthesisOutputFormat() function on the SpeechConfig object. This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. See the reference docs for a list of audio formats that are available.

There are various options for different file types depending on your requirements. Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

In this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");

    // set the output format
    speechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null);
    SpeechSynthesisResult result = synthesizer.SpeakText("Customizing audio output format.");
    AudioDataStream stream = AudioDataStream.fromResult(result);
    stream.saveToWavFile("path/to/write/file.wav");
}

Running your program again will write a .wav file to the specified path.

Use SSML to customize speech characteristics

Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

To start using SSML for customization, you make a simple change that switches the voice. First, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. This example changes the voice to a male English (UK) voice. Note that this voice is a standard voice, which has different pricing and availability than neural voices. See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the SpeakText() function, you use SpeakSsml(). This function expects an XML string, so first you create a function to load an XML file and return it as a string.

private static String xmlToString(String filePath) {
    File file = new File(filePath);
    StringBuilder fileContents = new StringBuilder((int)file.length());

    try (Scanner scanner = new Scanner(file)) {
        while(scanner.hasNextLine()) {
            fileContents.append(scanner.nextLine() + System.lineSeparator());
        }
        return fileContents.toString().trim();
    } catch (FileNotFoundException ex) {
        return "File not found.";
    }
}

From here, the result object is exactly the same as previous examples.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null);

    String ssml = xmlToString("ssml.xml");
    SpeechSynthesisResult result = synthesizer.SpeakSsml(ssml);
    AudioDataStream stream = AudioDataStream.fromResult(result);
    stream.saveToWavFile("path/to/write/file.wav");
}

The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

Neural voices

Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

To switch to a neural voice, change the name to one of the neural voice options. Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. Use the style param to customize the speaking style. This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

Important

Neural voices are only supported for Speech resources created in East US, South East Asia, and West Europe regions.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

In this quickstart, you learn common design patterns for doing text-to-speech synthesis using the Speech SDK. You start by doing basic configuration and synthesis, and move on to more advanced examples for custom application development including:

  • Getting responses as in-memory streams
  • Customizing output sample rate and bit rate
  • Submitting synthesis requests using SSML (speech synthesis markup language)
  • Using neural voices

Skip to samples on GitHub

If you want to skip straight to sample code, see the JavaScript quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK for JavaScript . Depending on your platform, use the following instructions:

Additionally, depending on the target environment use one of the following:

Download and extract the Speech SDK for JavaScript microsoft.cognitiveservices.speech.sdk.bundle.js file, and place it in a folder accessible to your HTML file.

<script src="microsoft.cognitiveservices.speech.sdk.bundle.js"></script>;

Tip

If you're targeting a web browser, and using the <script> tag; the sdk prefix is not needed. The sdk prefix is an alias used to name the require module.

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

In this example, you create a SpeechConfig using a subscription key and region. See the region support page to find your region identifier. You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

function synthesizeSpeech() {
    const speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
}

synthesizeSpeech();

Synthesize speech to a file

Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

To start, create an AudioConfig to automatically write the output to a .wav file using the fromAudioFileOutput() static function.

function synthesizeSpeech() {
    const speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const audioConfig = AudioConfig.fromAudioFileOutput("path/to/file.wav");
}

Next, instantiate a SpeechSynthesizer passing your speechConfig object and the audioConfig object as params. Then, executing speech synthesis and writing to a file is as simple as running speakTextAsync() with a string of text. The result callback is a great place to call synthesizer.close(), in fact - this call is needed in order for synthesis to function correctly.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const audioConfig = AudioConfig.fromAudioFileOutput("path-to-file.wav");

    const synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.speakTextAsync(
        "A simple test to write to a file.",
        result => {
            if (result) {
                console.log(JSON.stringify(result));
            }
            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

Run the program, and a synthesized .wav file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

Synthesize to speaker output

In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, instantiate the AudioConfig using the fromDefaultSpeakerOutput() static function. This outputs to the current active output device.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const audioConfig = AudioConfig.fromDefaultSpeakerOutput();

    const synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.speakTextAsync(
        "Synthesizing directly to speaker output.",
        result => {
            if (result) {
                console.log(JSON.stringify(result));
            }
            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

Get result as an in-memory stream

For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:

  • Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • Integrate the result with other API's or services.
  • Modify the audio data, write custom .wav headers, etc.

It's simple to make this change from the previous example. First, remove the AudioConfig block, as you will manage the output behavior manually from this point onward for increased control. Then pass undefined for the AudioConfig in the SpeechSynthesizer constructor.

Note

Passing undefined for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

This time, you save the result to a SpeechSynthesisResult variable. The SpeechSynthesisResult.audioData property returns an ArrayBuffer of the output data. You can work with this ArrayBuffer manually.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const synthesizer = new sdk.SpeechSynthesizer(speechConfig);

    synthesizer.speakTextAsync(
        "Getting the response as an in-memory stream.",
        result => {
            // Interact with the audio ArrayBuffer data
            const audioData = result.audioData;
            console.log(`Audio data byte size: ${audioData.byteLength}.`)

            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

From here you can implement any custom behavior using the resulting ArrayBuffer object.

Customize audio format

The following section shows how to customize audio output attributes including:

  • Audio file type
  • Sample-rate
  • Bit-depth

To change the audio format, you use the speechSynthesisOutputFormat property on the SpeechConfig object. This property expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. See the reference docs for a list of audio formats that are available.

There are various options for different file types depending on your requirements. Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

In this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the speechSynthesisOutputFormat on the SpeechConfig object. Similar to the example in the previous section, get the audio ArrayBuffer data and interact with it.

function synthesizeSpeech() {
    const speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");

    // Set the output format
    speechConfig.speechSynthesisOutputFormat = SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm;

    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, undefined);
    synthesizer.speakTextAsync(
        "Customizing audio output format.",
        result => {
            // Interact with the audio ArrayBuffer data
            const audioData = result.audioData;
            console.log(`Audio data byte size: ${audioData.byteLength}.`)

            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

Running your program again will write a .wav file to the specified path.

Use SSML to customize speech characteristics

Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

To start using SSML for customization, you make a simple change that switches the voice. First, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. This example changes the voice to a male English (UK) voice. Note that this voice is a standard voice, which has different pricing and availability than neural voices. See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the speakTextAsync() function, you use speakSsmlAsync(). This function expects an XML string, so first you create a function to load an XML file and return it as a string.

function xmlToString(filePath) {
    const xml = readFileSync(filePath, "utf8");
    return xml;
}

For more information on readFileSync, see Node.js file system. From here, the result object is exactly the same as previous examples.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, undefined);

    const ssml = xmlToString("ssml.xml");
    synthesizer.speakSsmlAsync(
        ssml,
        result => {
            if (result.errorDetails) {
                console.error(result.errorDetails);
            } else {
                console.log(JSON.stringify(result));
            }

            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

Neural voices

Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

To switch to a neural voice, change the name to one of the neural voice options. Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. Use the style param to customize the speaking style. This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

Important

Neural voices are only supported for Speech resources created in East US, South East Asia, and West Europe regions.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

You can synthesize speech from text using the Speech SDK for Swift and Objective-C.

Prerequisites

The following samples assume that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install Speech SDK and samples

The Cognitive Services Speech SDK contains samples written in in Swift and Objective-C for iOS and Mac. Click a link to see installation instructions for each sample:

We also provide an online Speech SDK for Objective-C Reference.

In this quickstart, you learn common design patterns for doing text-to-speech synthesis using the Speech SDK. You start by doing basic configuration and synthesis, and move on to more advanced examples for custom application development including:

  • Getting responses as in-memory streams
  • Customizing output sample rate and bit rate
  • Submitting synthesis requests using SSML (speech synthesis markup language)
  • Using neural voices

Skip to samples on GitHub

If you want to skip straight to sample code, see the Python quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK.

pip install azure-cognitiveservices-speech

If you're on macOS and run into install issues, you may need to run this command first.

python3 -m pip install --upgrade pip

After the Speech SDK is installed, include the following import statements at the top of your script.

from azure.cognitiveservices.speech import AudioDataStream, SpeechConfig, SpeechSynthesizer, SpeechSynthesisOutputFormat
from azure.cognitiveservices.speech.audio import AudioOutputConfig

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

In this example, you create a SpeechConfig using a subscription key and region. See the region support page to find your region identifier.

speech_config = SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion")

Synthesize speech to a file

Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioOutputConfig object that specifies how output results should be handled.

To start, create an AudioOutputConfig to automatically write the output to a .wav file, using the filename constructor param.

audio_config = AudioOutputConfig(filename="path/to/write/file.wav")

Next, instantiate a SpeechSynthesizer by passing your speech_config object and the audio_config object as params. Then, executing speech synthesis and writing to a file is as simple as running speak_text_async() with a string of text.

synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
synthesizer.speak_text_async("A simple test to write to a file.")

Run the program, and a synthesized .wav file is written to the location you specified. This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

Synthesize to speaker output

In some cases, you may want to directly output synthesized speech directly to a speaker. To do this, use the example in the previous section, but change the AudioOutputConfig by removing the filename param, and set use_default_speaker=True. This outputs to the current active output device.

audio_config = AudioOutputConfig(use_default_speaker=True)

Get result as an in-memory stream

For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. This will allow you to build custom behavior including:

  • Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • Integrate the result with other API's or services.
  • Modify the audio data, write custom .wav headers, etc.

It's simple to make this change from the previous example. First, remove the AudioConfig, as you will manage the output behavior manually from this point onward for increased control. Then pass None for the AudioConfig in the SpeechSynthesizer constructor.

Note

Passing None for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

This time, you save the result to a SpeechSynthesisResult variable. The audio_data property contains a bytes object of the output data. You can work with this object manually, or you can use the AudioDataStream class to manage the in-memory stream. In this example you use the AudioDataStream constructor to get a stream from the result.

synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)
result = synthesizer.speak_text_async("Getting the response as an in-memory stream.").get()
stream = AudioDataStream(result)

From here you can implement any custom behavior using the resulting stream object.

Customize audio format

The following section shows how to customize audio output attributes including:

  • Audio file type
  • Sample-rate
  • Bit-depth

To change the audio format, you use the set_speech_synthesis_output_format() function on the SpeechConfig object. This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. See the reference docs for a list of audio formats that are available.

There are various options for different file types depending on your requirements. Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

In this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

speech_config.set_speech_synthesis_output_format(SpeechSynthesisOutputFormat["Riff24Khz16BitMonoPcm"])
synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)

result = synthesizer.speak_text_async("Customizing audio output format.").get()
stream = AudioDataStream(result)
stream.save_to_wav_file("path/to/write/file.wav")

Running your program again will write a customized .wav file to the specified path.

Use SSML to customize speech characteristics

Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

To start using SSML for customization, you make a simple change that switches the voice. First, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. This example changes the voice to a male English (UK) voice. Note that this voice is a standard voice, which has different pricing and availability than neural voices. See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

Next, you need to change the speech synthesis request to reference your XML file. The request is mostly the same, but instead of using the speak_text_async() function, you use speak_ssml_async(). This function expects an XML string, so you first read your SSML config as a string. From here, the result object is exactly the same as previous examples.

Note

If your ssml_string contains  at the beginning of the string, you need to strip off the BOM format or the service will return an error. You do this by setting the encoding parameter as follows: open("ssml.xml", "r", encoding="utf-8-sig").

synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)

ssml_string = open("ssml.xml", "r").read()
result = synthesizer.speak_ssml_async(ssml_string).get()

stream = AudioDataStream(result)
stream.save_to_wav_file("path/to/write/file.wav")

The output works, but there a few simple additional changes you can make to help it sound more natural. The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

Neural voices

Neural voices are speech synthesis algorithms powered by deep neural networks. When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

To switch to a neural voice, change the name to one of the neural voice options. Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. Use the style param to customize the speaking style. This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

Important

Neural voices are only supported for Speech resources created in East US, South East Asia, and West Europe regions.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

In this quickstart, you learn how to convert text to speech using the Speech service and cURL.

For a high-level look at Text-To-Speech concepts, see the overview article.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Convert text to speech

At a command prompt, run the following command. You will need to insert the following values into the command.

  • Your Speech service subscription key.
  • Your Speech service region.

You might also wish to change the following values.

  • The X-Microsoft-OutputFormat header value, which controls the audio output format. You can find a list of supported audio output formats in the text-to-speech REST API reference.
  • The output voice. To get a list of voices available for your Speech endpoint, see the next section.
  • The output file. In this example, we direct the response from the server into a file named output.wav.
curl --location --request POST 'https://INSERT_REGION_HERE.tts.speech.microsoft.com/cognitiveservices/v1' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: application/ssml+xml' \
--header 'X-Microsoft-OutputFormat: audio-16khz-128kbitrate-mono-mp3' \
--header 'User-Agent: curl' \
--data-raw '<speak version='\''1.0'\'' xml:lang='\''en-US'\''>
    <voice xml:lang='\''en-US'\'' xml:gender='\''Female'\'' name='\''en-US-AriaRUS'\''>
        my voice is my passport verify me
    </voice>
</speak>' > output.wav

List available voices for your Speech endpoint

To list the available voices for your Speech endpoint, run the following command.

curl --location --request GET 'https://INSERT_ENDPOINT_HERE.tts.speech.microsoft.com/cognitiveservices/voices/list' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE'

You should receive a response like the following one.

[
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ar-EG, Hoda)",
        "DisplayName": "Hoda",
        "LocalName": "هدى",
        "ShortName": "ar-EG-Hoda",
        "Gender": "Female",
        "Locale": "ar-EG",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ar-SA, Naayf)",
        "DisplayName": "Naayf",
        "LocalName": "نايف",
        "ShortName": "ar-SA-Naayf",
        "Gender": "Male",
        "Locale": "ar-SA",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (bg-BG, Ivan)",
        "DisplayName": "Ivan",
        "LocalName": "Иван",
        "ShortName": "bg-BG-Ivan",
        "Gender": "Male",
        "Locale": "bg-BG",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ca-ES, HerenaRUS)",
        "DisplayName": "Herena",
        "LocalName": "Helena",
        "ShortName": "ca-ES-HerenaRUS",
        "Gender": "Female",
        "Locale": "ca-ES",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (cs-CZ, Jakub)",
        "DisplayName": "Jakub",
        "LocalName": "Jakub",
        "ShortName": "cs-CZ-Jakub",
        "Gender": "Male",
        "Locale": "cs-CZ",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (da-DK, HelleRUS)",
        "DisplayName": "Helle",
        "LocalName": "Helle",
        "ShortName": "da-DK-HelleRUS",
        "Gender": "Female",
        "Locale": "da-DK",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (de-AT, Michael)",
        "DisplayName": "Michael",
        "LocalName": "Michael",
        "ShortName": "de-AT-Michael",
        "Gender": "Male",
        "Locale": "de-AT",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (de-CH, Karsten)",
        "DisplayName": "Karsten",
        "LocalName": "Karsten",
        "ShortName": "de-CH-Karsten",
        "Gender": "Male",
        "Locale": "de-CH",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (de-DE, HeddaRUS)",
        "DisplayName": "Hedda",
        "LocalName": "Hedda",
        "ShortName": "de-DE-HeddaRUS",
        "Gender": "Female",
        "Locale": "de-DE",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (de-DE, Stefan)",
        "DisplayName": "Stefan",
        "LocalName": "Stefan",
        "ShortName": "de-DE-Stefan",
        "Gender": "Male",
        "Locale": "de-DE",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (el-GR, Stefanos)",
        "DisplayName": "Stefanos",
        "LocalName": "Στέφανος",
        "ShortName": "el-GR-Stefanos",
        "Gender": "Male",
        "Locale": "el-GR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-AU, Catherine)",
        "DisplayName": "Catherine",
        "LocalName": "Catherine",
        "ShortName": "en-AU-Catherine",
        "Gender": "Female",
        "Locale": "en-AU",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-AU, HayleyRUS)",
        "DisplayName": "Hayley",
        "LocalName": "Hayley",
        "ShortName": "en-AU-HayleyRUS",
        "Gender": "Female",
        "Locale": "en-AU",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-CA, HeatherRUS)",
        "DisplayName": "Heather",
        "LocalName": "Heather",
        "ShortName": "en-CA-HeatherRUS",
        "Gender": "Female",
        "Locale": "en-CA",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-CA, Linda)",
        "DisplayName": "Linda",
        "LocalName": "Linda",
        "ShortName": "en-CA-Linda",
        "Gender": "Female",
        "Locale": "en-CA",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-GB, George)",
        "DisplayName": "George",
        "LocalName": "George",
        "ShortName": "en-GB-George",
        "Gender": "Male",
        "Locale": "en-GB",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-GB, HazelRUS)",
        "DisplayName": "Hazel",
        "LocalName": "Hazel",
        "ShortName": "en-GB-HazelRUS",
        "Gender": "Female",
        "Locale": "en-GB",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-GB, Susan)",
        "DisplayName": "Susan",
        "LocalName": "Susan",
        "ShortName": "en-GB-Susan",
        "Gender": "Female",
        "Locale": "en-GB",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-IE, Sean)",
        "DisplayName": "Sean",
        "LocalName": "Sean",
        "ShortName": "en-IE-Sean",
        "Gender": "Male",
        "Locale": "en-IE",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-IN, Heera)",
        "DisplayName": "Heera",
        "LocalName": "Heera",
        "ShortName": "en-IN-Heera",
        "Gender": "Female",
        "Locale": "en-IN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-IN, PriyaRUS)",
        "DisplayName": "Priya",
        "LocalName": "Priya",
        "ShortName": "en-IN-PriyaRUS",
        "Gender": "Female",
        "Locale": "en-IN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-IN, Ravi)",
        "DisplayName": "Ravi",
        "LocalName": "Ravi",
        "ShortName": "en-IN-Ravi",
        "Gender": "Male",
        "Locale": "en-IN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-US, AriaRUS)",
        "DisplayName": "Aria",
        "LocalName": "Aria",
        "ShortName": "en-US-AriaRUS",
        "Gender": "Female",
        "Locale": "en-US",
        "SampleRateHertz": "24000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-US, BenjaminRUS)",
        "DisplayName": "Benjamin",
        "LocalName": "Benjamin",
        "ShortName": "en-US-BenjaminRUS",
        "Gender": "Male",
        "Locale": "en-US",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-US, GuyRUS)",
        "DisplayName": "Guy",
        "LocalName": "Guy",
        "ShortName": "en-US-GuyRUS",
        "Gender": "Male",
        "Locale": "en-US",
        "SampleRateHertz": "24000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)",
        "DisplayName": "Zira",
        "LocalName": "Zira",
        "ShortName": "en-US-ZiraRUS",
        "Gender": "Female",
        "Locale": "en-US",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (es-ES, HelenaRUS)",
        "DisplayName": "Helena",
        "LocalName": "Helena",
        "ShortName": "es-ES-HelenaRUS",
        "Gender": "Female",
        "Locale": "es-ES",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (es-ES, Laura)",
        "DisplayName": "Laura",
        "LocalName": "Laura",
        "ShortName": "es-ES-Laura",
        "Gender": "Female",
        "Locale": "es-ES",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (es-ES, Pablo)",
        "DisplayName": "Pablo",
        "LocalName": "Pablo",
        "ShortName": "es-ES-Pablo",
        "Gender": "Male",
        "Locale": "es-ES",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (es-MX, HildaRUS)",
        "DisplayName": "Hilda",
        "LocalName": "Hilda",
        "ShortName": "es-MX-HildaRUS",
        "Gender": "Female",
        "Locale": "es-MX",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (es-MX, Raul)",
        "DisplayName": "Raul",
        "LocalName": "Raúl",
        "ShortName": "es-MX-Raul",
        "Gender": "Male",
        "Locale": "es-MX",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (fi-FI, HeidiRUS)",
        "DisplayName": "Heidi",
        "LocalName": "Heidi",
        "ShortName": "fi-FI-HeidiRUS",
        "Gender": "Female",
        "Locale": "fi-FI",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (fr-CA, Caroline)",
        "DisplayName": "Caroline",
        "LocalName": "Caroline",
        "ShortName": "fr-CA-Caroline",
        "Gender": "Female",
        "Locale": "fr-CA",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (fr-CA, HarmonieRUS)",
        "DisplayName": "Harmonie",
        "LocalName": "Harmonie",
        "ShortName": "fr-CA-HarmonieRUS",
        "Gender": "Female",
        "Locale": "fr-CA",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (fr-CH, Guillaume)",
        "DisplayName": "Guillaume",
        "LocalName": "Guillaume",
        "ShortName": "fr-CH-Guillaume",
        "Gender": "Male",
        "Locale": "fr-CH",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (fr-FR, HortenseRUS)",
        "DisplayName": "Hortense",
        "LocalName": "Hortense",
        "ShortName": "fr-FR-HortenseRUS",
        "Gender": "Female",
        "Locale": "fr-FR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (fr-FR, Julie)",
        "DisplayName": "Julie",
        "LocalName": "Julie",
        "ShortName": "fr-FR-Julie",
        "Gender": "Female",
        "Locale": "fr-FR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (fr-FR, Paul)",
        "DisplayName": "Paul",
        "LocalName": "Paul",
        "ShortName": "fr-FR-Paul",
        "Gender": "Male",
        "Locale": "fr-FR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (he-IL, Asaf)",
        "DisplayName": "Asaf",
        "LocalName": "אסף",
        "ShortName": "he-IL-Asaf",
        "Gender": "Male",
        "Locale": "he-IL",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (hi-IN, Hemant)",
        "DisplayName": "Hemant",
        "LocalName": "हेमन्त",
        "ShortName": "hi-IN-Hemant",
        "Gender": "Male",
        "Locale": "hi-IN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (hi-IN, Kalpana)",
        "DisplayName": "Kalpana",
        "LocalName": "कल्पना",
        "ShortName": "hi-IN-Kalpana",
        "Gender": "Female",
        "Locale": "hi-IN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (hr-HR, Matej)",
        "DisplayName": "Matej",
        "LocalName": "Matej",
        "ShortName": "hr-HR-Matej",
        "Gender": "Male",
        "Locale": "hr-HR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (hu-HU, Szabolcs)",
        "DisplayName": "Szabolcs",
        "LocalName": "Szabolcs",
        "ShortName": "hu-HU-Szabolcs",
        "Gender": "Male",
        "Locale": "hu-HU",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (id-ID, Andika)",
        "DisplayName": "Andika",
        "LocalName": "Andika",
        "ShortName": "id-ID-Andika",
        "Gender": "Male",
        "Locale": "id-ID",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (it-IT, Cosimo)",
        "DisplayName": "Cosimo",
        "LocalName": "Cosimo",
        "ShortName": "it-IT-Cosimo",
        "Gender": "Male",
        "Locale": "it-IT",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (it-IT, LuciaRUS)",
        "DisplayName": "Lucia",
        "LocalName": "Lucia",
        "ShortName": "it-IT-LuciaRUS",
        "Gender": "Female",
        "Locale": "it-IT",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ja-JP, Ayumi)",
        "DisplayName": "Ayumi",
        "LocalName": "歩美",
        "ShortName": "ja-JP-Ayumi",
        "Gender": "Female",
        "Locale": "ja-JP",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ja-JP, HarukaRUS)",
        "DisplayName": "Haruka",
        "LocalName": "春香",
        "ShortName": "ja-JP-HarukaRUS",
        "Gender": "Female",
        "Locale": "ja-JP",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ja-JP, Ichiro)",
        "DisplayName": "Ichiro",
        "LocalName": "一郎",
        "ShortName": "ja-JP-Ichiro",
        "Gender": "Male",
        "Locale": "ja-JP",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ko-KR, HeamiRUS)",
        "DisplayName": "Heami",
        "LocalName": "해 미",
        "ShortName": "ko-KR-HeamiRUS",
        "Gender": "Female",
        "Locale": "ko-KR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ms-MY, Rizwan)",
        "DisplayName": "Rizwan",
        "LocalName": "Rizwan",
        "ShortName": "ms-MY-Rizwan",
        "Gender": "Male",
        "Locale": "ms-MY",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (nb-NO, HuldaRUS)",
        "DisplayName": "Hulda",
        "LocalName": "Hulda",
        "ShortName": "nb-NO-HuldaRUS",
        "Gender": "Female",
        "Locale": "nb-NO",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (nl-NL, HannaRUS)",
        "DisplayName": "Hanna",
        "LocalName": "Hanna",
        "ShortName": "nl-NL-HannaRUS",
        "Gender": "Female",
        "Locale": "nl-NL",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (pl-PL, PaulinaRUS)",
        "DisplayName": "Paulina",
        "LocalName": "Paulina",
        "ShortName": "pl-PL-PaulinaRUS",
        "Gender": "Female",
        "Locale": "pl-PL",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (pt-BR, Daniel)",
        "DisplayName": "Daniel",
        "LocalName": "Daniel",
        "ShortName": "pt-BR-Daniel",
        "Gender": "Male",
        "Locale": "pt-BR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (pt-BR, HeloisaRUS)",
        "DisplayName": "Heloisa",
        "LocalName": "Heloisa",
        "ShortName": "pt-BR-HeloisaRUS",
        "Gender": "Female",
        "Locale": "pt-BR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (pt-PT, HeliaRUS)",
        "DisplayName": "Helia",
        "LocalName": "Hélia",
        "ShortName": "pt-PT-HeliaRUS",
        "Gender": "Female",
        "Locale": "pt-PT",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ro-RO, Andrei)",
        "DisplayName": "Andrei",
        "LocalName": "Andrei",
        "ShortName": "ro-RO-Andrei",
        "Gender": "Male",
        "Locale": "ro-RO",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ru-RU, EkaterinaRUS)",
        "DisplayName": "Ekaterina",
        "LocalName": "Екатерина",
        "ShortName": "ru-RU-EkaterinaRUS",
        "Gender": "Female",
        "Locale": "ru-RU",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ru-RU, Irina)",
        "DisplayName": "Irina",
        "LocalName": "Ирина",
        "ShortName": "ru-RU-Irina",
        "Gender": "Female",
        "Locale": "ru-RU",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ru-RU, Pavel)",
        "DisplayName": "Pavel",
        "LocalName": "Павел",
        "ShortName": "ru-RU-Pavel",
        "Gender": "Male",
        "Locale": "ru-RU",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (sk-SK, Filip)",
        "DisplayName": "Filip",
        "LocalName": "Filip",
        "ShortName": "sk-SK-Filip",
        "Gender": "Male",
        "Locale": "sk-SK",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (sl-SI, Lado)",
        "DisplayName": "Lado",
        "LocalName": "Lado",
        "ShortName": "sl-SI-Lado",
        "Gender": "Male",
        "Locale": "sl-SI",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (sv-SE, HedvigRUS)",
        "DisplayName": "Hedvig",
        "LocalName": "Hedvig",
        "ShortName": "sv-SE-HedvigRUS",
        "Gender": "Female",
        "Locale": "sv-SE",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ta-IN, Valluvar)",
        "DisplayName": "Valluvar",
        "LocalName": "வள்ளுவர்",
        "ShortName": "ta-IN-Valluvar",
        "Gender": "Male",
        "Locale": "ta-IN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (te-IN, Chitra)",
        "DisplayName": "Chitra",
        "LocalName": "చిత్ర",
        "ShortName": "te-IN-Chitra",
        "Gender": "Female",
        "Locale": "te-IN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (th-TH, Pattara)",
        "DisplayName": "Pattara",
        "LocalName": "ภัทรา",
        "ShortName": "th-TH-Pattara",
        "Gender": "Male",
        "Locale": "th-TH",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (tr-TR, SedaRUS)",
        "DisplayName": "Seda",
        "LocalName": "Seda",
        "ShortName": "tr-TR-SedaRUS",
        "Gender": "Female",
        "Locale": "tr-TR",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (vi-VN, An)",
        "DisplayName": "An",
        "LocalName": "An",
        "ShortName": "vi-VN-An",
        "Gender": "Male",
        "Locale": "vi-VN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-CN, HuihuiRUS)",
        "DisplayName": "Huihui",
        "LocalName": "慧慧",
        "ShortName": "zh-CN-HuihuiRUS",
        "Gender": "Female",
        "Locale": "zh-CN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-CN, Kangkang)",
        "DisplayName": "Kangkang",
        "LocalName": "康康",
        "ShortName": "zh-CN-Kangkang",
        "Gender": "Male",
        "Locale": "zh-CN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-CN, Yaoyao)",
        "DisplayName": "Yaoyao",
        "LocalName": "瑶瑶",
        "ShortName": "zh-CN-Yaoyao",
        "Gender": "Female",
        "Locale": "zh-CN",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-HK, Danny)",
        "DisplayName": "Danny",
        "LocalName": "Danny",
        "ShortName": "zh-HK-Danny",
        "Gender": "Male",
        "Locale": "zh-HK",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-HK, TracyRUS)",
        "DisplayName": "Tracy",
        "LocalName": "Tracy",
        "ShortName": "zh-HK-TracyRUS",
        "Gender": "Female",
        "Locale": "zh-HK",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-TW, HanHanRUS)",
        "DisplayName": "HanHan",
        "LocalName": "涵涵",
        "ShortName": "zh-TW-HanHanRUS",
        "Gender": "Female",
        "Locale": "zh-TW",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-TW, Yating)",
        "DisplayName": "Yating",
        "LocalName": "雅婷",
        "ShortName": "zh-TW-Yating",
        "Gender": "Female",
        "Locale": "zh-TW",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    },
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-TW, Zhiwei)",
        "DisplayName": "Zhiwei",
        "LocalName": "志威",
        "ShortName": "zh-TW-Zhiwei",
        "Gender": "Male",
        "Locale": "zh-TW",
        "SampleRateHertz": "16000",
        "VoiceType": "Standard"
    }
]

In this quickstart, you learn common design patterns for doing text-to-speech synthesis using the Speech SDK. You start by doing basic configuration and synthesis, and move on to more advanced examples for custom application development including:

  • Getting responses as in-memory streams
  • Customizing output sample rate and bit rate
  • Submitting synthesis requests using SSML (speech synthesis markup language)
  • Using neural voices

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Download and install

Note

On Windows, you need the Microsoft Visual C++ Redistributable for Visual Studio 2019 for your platform. Installing this for the first time may require you to restart Windows.

Follow these steps to install the Speech CLI on Windows:

  1. Download the Speech CLI zip archive, then extract it.
  2. Go to the root directory spx-zips that you extracted from the download, and extract the subdirectory that you need (spx-net471 for .NET Framework 4.7, or spx-netcore-win-x64 for .NET Core 3.0 on an x64 CPU).

In the command prompt, change directory to this location, and then type spx to see help for the Speech CLI.

Note

On Windows, the Speech CLI can only show fonts available to the command prompt on the local computer. Windows Terminal supports all fonts produced interactively by the Speech CLI. If you output to a file, a text editor like Notepad or a web browser like Microsoft Edge can also show all fonts.

Note

Powershell does not check the local directory when looking for a command. In Powershell, change directory to the location of spx and call the tool by entering .\spx. If you add this directory to your path, Powershell and the Windows command prompt will find spx from any directory without including the .\ prefix.

Create subscription config

To start using the Speech CLI, you first need to enter your Speech subscription key and region information. See the region support page to find your region identifier. Once you have your subscription key and region identifier (ex. eastus, westus), run the following commands.

spx config @key --set SUBSCRIPTION-KEY
spx config @region --set REGION

Your subscription authentication is now stored for future SPX requests. If you need to remove either of these stored values, run spx config @region --clear or spx config @key --clear.

Synthesize speech to a speaker

Now you're ready to run the Speech CLI to synthesize speech from text. From the command line, change to the directory that contains the Speech CLI binary file. Then run the following command.

spx synthesize --text "The speech synthesizer greets you!"

The Speech CLI will produce natural language in English through the computer speaker.

Synthesize speech to a file

Run the following command to change the output from your speaker to a .wav file.

spx synthesize --text "The speech synthesizer greets you!" --audio output greetings.wav

The Speech CLI will produce natural language in English into the greetings.wav audio file. In Windows, you can play the audio file by entering start greetings.wav.

Next steps