Quickstart: Create captions with speech to text

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Prerequisites

Set up the environment

The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK in the next section of this article, but first check the SDK installation guide for any more requirements.

You must also install GStreamer for compressed input audio.

Create captions from speech

Follow these steps to create a new console application and install the Speech SDK.

  1. Open a command prompt where you want the new project, and create a console application with the .NET CLI.

    dotnet new console
    
  2. Install the Speech SDK in your new project with the .NET CLI.

    dotnet add package Microsoft.CognitiveServices.Speech
    
  3. Copy the scenarios/csharp/dotnetcore/captioning/ sample files from GitHub into your project directory. Overwrite the local copy of Program.cs with the file that you copy from GitHub.

  4. Build the project with the .NET CLI.

    dotnet build
    
  5. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:

    dotnet run --key YourSubscriptionKey --region YourServiceRegion --input caption.this.mp4 --format any --output caption.output.txt - --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Cognitive Services security article for more information.

    The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

    00:00:00,180 --> 00:00:01,600
    Welcome to
    
    00:00:00,180 --> 00:00:01,820
    Welcome to applied
    
    00:00:00,180 --> 00:00:02,420
    Welcome to applied mathematics
    
    00:00:00,180 --> 00:00:02,930
    Welcome to applied mathematics course
    
    00:00:00,180 --> 00:00:03,100
    Welcome to applied Mathematics course 2
    
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    

Usage and arguments

Usage: captioning --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value with this code example is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Prerequisites

Set up the environment

The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK in the next section of this article, but first check the SDK installation guide for any more requirements

You must also install GStreamer for compressed input audio.

Create captions from speech

Follow these steps to create a new console application and install the Speech SDK.

  1. Download or copy the scenarios/cpp/windows/captioning/ sample files from GitHub into a local directory.

  2. Open the captioning.sln solution file in Visual Studio.

  3. Install the Speech SDK in your project with the NuGet package manager.

    Install-Package Microsoft.CognitiveServices.Speech
    
  4. Open Project > Properties > General. Set Configuration to All configurations. Set C++ Language Standard to ISO C++17 Standard (/std:c++17).

  5. Open Build > Configuration Manager.

    • On a 64-bit Windows installation, set Active solution platform to x64.
    • On a 32-bit Windows installation, set Active solution platform to x86.
  6. Open Project > Properties > Debugging. Enter your preferred command line arguments at Command Arguments. See usage and arguments for the available options. Here is an example:

    --key YourSubscriptionKey --region YourServiceRegion --input caption.this.mp4 --format any --output caption.output.txt - --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Cognitive Services security article for more information.

  7. Build and run the console application. The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

    00:00:00,180 --> 00:00:01,600
    Welcome to
    
    00:00:00,180 --> 00:00:01,820
    Welcome to applied
    
    00:00:00,180 --> 00:00:02,420
    Welcome to applied mathematics
    
    00:00:00,180 --> 00:00:02,930
    Welcome to applied mathematics course
    
    00:00:00,180 --> 00:00:03,100
    Welcome to applied Mathematics course 2
    
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    

Usage and arguments

Usage: captioning --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value with this code example is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (Go) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Prerequisites

Set up the environment

Check whether there are any platform-specific installation steps.

You must also install GStreamer for compressed input audio.

Create captions from speech

Follow these steps to create a new GO module and install the Speech SDK.

  1. Download or copy the scenarios/go/captioning/ sample files from GitHub into a local directory.

  2. Open a command prompt in the same directory as captioning.go.

  3. Run the following commands to create a go.mod file that links to the Speech SDK components hosted on GitHub:

    go mod init captioning
    go get github.com/Microsoft/cognitive-services-speech-sdk-go
    
  4. Build the GO module.

    go build
    
  5. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:

    go run captioning --key YourSubscriptionKey --region YourServiceRegion --input caption.this.mp4 --format any --output caption.output.txt - --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Cognitive Services security article for more information.

    The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

    00:00:00,180 --> 00:00:01,600
    Welcome to
    
    00:00:00,180 --> 00:00:01,820
    Welcome to applied
    
    00:00:00,180 --> 00:00:02,420
    Welcome to applied mathematics
    
    00:00:00,180 --> 00:00:02,930
    Welcome to applied mathematics course
    
    00:00:00,180 --> 00:00:03,100
    Welcome to applied Mathematics course 2
    
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    

Usage and arguments

Usage: go run captioning.go helper.go --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value with this code example is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Prerequisites

Set up the environment

Before you can do anything, you need to install the Speech SDK. The sample in this quickstart works with the Java Runtime.

  1. Install Apache Maven
  2. Create a new pom.xml file in the root of your project, and copy the following into it:
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <groupId>com.microsoft.cognitiveservices.speech.samples</groupId>
        <artifactId>quickstart-eclipse</artifactId>
        <version>1.0.0-SNAPSHOT</version>
        <build>
            <sourceDirectory>src</sourceDirectory>
            <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.7.0</version>
                <configuration>
                <source>1.8</source>
                <target>1.8</target>
                </configuration>
            </plugin>
            </plugins>
        </build>
        <repositories>
            <repository>
            <id>maven-cognitiveservices-speech</id>
            <name>Microsoft Cognitive Services Speech Maven Repository</name>
            <url>https://azureai.azureedge.net/maven/</url>
            </repository>
        </repositories>
        <dependencies>
            <dependency>
            <groupId>com.microsoft.cognitiveservices.speech</groupId>
            <artifactId>client-sdk</artifactId>
            <version>1.23.0</version>
            </dependency>
        </dependencies>
    </project>
    
  3. Install the Speech SDK and dependencies.
    mvn clean dependency:copy-dependencies
    
  4. You must also install GStreamer for compressed input audio.

Create captions from speech

Follow these steps to create a new console application and install the Speech SDK.

  1. Copy the scenarios/java/jre/captioning/ sample files from GitHub into your project directory. The pom.xml file that you created in environment setup must also be in this directory.

  2. Open a command prompt and run this command to compile the project files.

    javac Captioning.java -cp ".;target\dependency\*"
    
  3. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:

    java -cp ".;target\dependency\*" Captioning --key YourSubscriptionKey --region YourServiceRegion --input caption.this.mp4 --format any --output caption.output.txt - --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Cognitive Services security article for more information.

    The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

    00:00:00,180 --> 00:00:01,600
    Welcome to
    
    00:00:00,180 --> 00:00:01,820
    Welcome to applied
    
    00:00:00,180 --> 00:00:02,420
    Welcome to applied mathematics
    
    00:00:00,180 --> 00:00:02,930
    Welcome to applied mathematics course
    
    00:00:00,180 --> 00:00:03,100
    Welcome to applied Mathematics course 2
    
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    
    

Usage and arguments

Usage: java -cp ".;target\dependency\*" Captioning --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value with this code example is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (npm) | Additional Samples on GitHub | Library source code

In this quickstart, you run a console app to create captions with speech to text.

Prerequisites

Set up the environment

Before you can do anything, you need to install the Speech SDK for JavaScript. If you just want the package name to install, run npm install microsoft-cognitiveservices-speech-sdk. For guided installation instructions, see the SDK installation guide.

Create captions from speech

Follow these steps to create a Node.js console application and install the Speech SDK.

  1. Copy the scenarios/javascript/node/captioning/ sample files from GitHub into your project directory.

  2. Open a command prompt in the same directory as Captioning.js.

  3. Install the Speech SDK for JavaScript:

    npm install microsoft-cognitiveservices-speech-sdk
    
  4. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:

    node captioning.js --key YourSubscriptionKey --region YourServiceRegion --input caption.this.wav --output caption.output.txt - --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Note

    The Speech SDK for JavaScript does not support compressed input audio. You must use a WAV file as shown in the example.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Cognitive Services security article for more information.

    The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

    00:00:00,180 --> 00:00:01,600
    Welcome to
    
    00:00:00,180 --> 00:00:01,820
    Welcome to applied
    
    00:00:00,180 --> 00:00:02,420
    Welcome to applied mathematics
    
    00:00:00,180 --> 00:00:02,930
    Welcome to applied mathematics course
    
    00:00:00,180 --> 00:00:03,100
    Welcome to applied Mathematics course 2
    
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    
    

Usage and arguments

Usage: node captioning.js --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value with this code example is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (Download) | Additional Samples on GitHub

The Speech SDK for Objective-C does support getting speech recognition results for captioning, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Objective-C reference and samples linked from the beginning of this article.

Reference documentation | Package (Download) | Additional Samples on GitHub

The Speech SDK for Swift does support getting speech recognition results for captioning, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Swift reference and samples linked from the beginning of this article.

Reference documentation | Package (PyPi) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Prerequisites

Set up the environment

The Speech SDK for Python is available as a Python Package Index (PyPI) module. The Speech SDK for Python is compatible with Windows, Linux, and macOS.

  1. Install a version of Python from 3.7 to 3.10. First check the SDK installation guide for any more requirements
  2. You must also install GStreamer for compressed input audio.

Create captions from speech

Follow these steps to create a new console application.

  1. Download or copy the scenarios/python/console/captioning/ sample files from GitHub into a local directory.

  2. Open a command prompt in the same directory as captioning.py.

  3. Run this command to install the Speech SDK:

    pip install azure-cognitiveservices-speech
    
  4. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:

    python captioning.py --key YourSubscriptionKey --region YourServiceRegion --input caption.this.mp4 --format any --output caption.output.txt - --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Cognitive Services security article for more information.

    The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

    00:00:00,180 --> 00:00:01,600
    Welcome to
    
    00:00:00,180 --> 00:00:01,820
    Welcome to applied
    
    00:00:00,180 --> 00:00:02,420
    Welcome to applied mathematics
    
    00:00:00,180 --> 00:00:02,930
    Welcome to applied mathematics course
    
    00:00:00,180 --> 00:00:03,100
    Welcome to applied Mathematics course 2
    
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    

Usage and arguments

Usage: python captioning.py --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value with this code example is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

In this quickstart, you run a console app to create captions with speech to text.

Prerequisites

Set up the environment

Follow these steps and see the Speech CLI quickstart for additional requirements for your platform.

  1. Install the Speech CLI via the .NET CLI by entering this command:

    dotnet tool install --global Microsoft.CognitiveServices.Speech.CLI
    
  2. Configure your Speech resource key and region, by running the following commands. Replace SUBSCRIPTION-KEY with your Speech resource key, and replace REGION with your Speech resource region:

    spx config @key --set SUBSCRIPTION-KEY
    spx config @region --set REGION
    

You must also install GStreamer for compressed input audio.

Create captions from speech

With the Speech CLI, you can output both SRT (SubRip Text) and WebVTT (Web Video Text Tracks) captions from any type of media that contains audio.

To recognize audio from a file and output both WebVtt (vtt) and SRT (srt) captions, follow these steps.

  1. Make sure that you have an input file named caption.this.mp4 in the path.

  2. Run the following command to output captions from the video file:

    spx recognize --file caption.this.mp4 --format any --output vtt file - --output srt file - --output each file - @output.each.detailed --property SpeechServiceResponse_StablePartialResultThreshold=5 --profanity masked --phrases "Constoso;Jessie;Rehaan"
    

    The SRT and WebVTT captions are output to the console as shown here:

    1
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    WEBVTT
    
    00:00:00.180 --> 00:00:03.230
    Welcome to applied Mathematics course 201.
    {
      "ResultId": "561a0ea00cc14bb09bd294357df3270f",
      "Duration": "00:00:03.0500000"
    }
    

Usage and arguments

Here are details about the optional arguments from the previous command:

  • --file caption.this.mp4 --format any: Input audio from file. The default input is the microphone. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
  • --output vtt file - and --output srt file -: Outputs WebVTT and SRT captions to standard output. For more information about SRT and WebVTT caption file formats, see Caption output format. For more information about the --output argument, see Speech CLI output options.
  • @output.each.detailed: Outputs event results with text, offset, and duration. For more information, see Get speech recognition results.
  • --property SpeechServiceResponse_StablePartialResultThreshold=5: You can request that the Speech service return fewer Recognizing events that are more accurate. In this example, the Speech service must affirm recognition of a word at least five times before returning the partial results to you. For more information, see Get partial results concepts.
  • --profanity masked: You can specify whether to mask, remove, or show profanity in recognition results. For more information, see Profanity filter concepts.
  • --phrases "Constoso;Jessie;Rehaan": You can specify a list of phrases to be recognized, such as Contoso, Jessie, and Rehaan. For more information, see Improve recognition with phrase list.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Next steps