Quickstart: Recognize speech stored in blob storage

In this quickstart, you will use a REST API to recognize speech from files in a batch process. A batch process executes the speech transcription without any user interactions. It gives you a simple programming model, without the need to manage concurrency, custom speech models, or other details. It entails advanced control options, while making efficient use of Azure speech service resources.

For more information on the available options and configuration details, see batch transcription.

The following quickstart will walk you through a usage sample.

If you prefer to jump right in, view or download all Speech SDK C# Samples on GitHub. Otherwise, let's get started.

Prerequisites

Before you get started, make sure to:

Open your project in Visual Studio

The first step is to make sure that you have your project open in Visual Studio.

  1. Launch Visual Studio 2019.
  2. Load your project and open Program.cs.

Add a reference to Newtonsoft.Json

  1. In the Solution Explorer, right-click the helloworld project, and then select Manage NuGet Packages to show the NuGet Package Manager.
  2. In the upper-right corner, find the Package Source drop-down box, and make sure that nuget.org is selected.
  3. In the upper-left corner, select Browse.
  4. In the search box, type newtonsoft.json and select Enter.
  5. From the search results, select the Newtonsoft.Json package, and then select Install to install the latest stable version.
  6. Accept all agreements and licenses to start the installation. After the package is installed, a confirmation appears in the Package Manager Console window.

Start with some boilerplate code

Let's add some code that works as a skeleton for our project.

class Program
{
    // Replace with your subscription key
    const string SubscriptionKey = "YourSubscriptionKey";

    // Update with your service region
    const string Region = "YourServiceRegion";
    const int Port = 443;
 
    // Recordings and locale
    const string Locale = "en-US";
    const string RecordingsBlobUri = "YourFileUrl";
 
    // Name and description
    const string Name = "Simple transcription";
    const string Description = "Simple transcription description";
 
    const string SpeechToTextBasePath = "api/speechtotext/v2.0/";
 
    static async Task Main()
    {
        // Cognitive Services follows security best practices.
        // If you experience connectivity issues, see:
        // https://docs.microsoft.com/en-us/dotnet/framework/network-programming/tls
 
        await TranscribeAsync();
    }
 
    static async Task TranscribeAsync()
    {
        Console.WriteLine("Starting transcriptions client...");
    }
}

You'll need to replace the following values:

  • YourSubscriptionKey: found in the the Keys page of the Azure portal for the Speech resource
  • YourServiceRegion: found in the the Overview page of the Azure portal for the Speech resource
  • YourFileUrl: found in under the Blob service / Containers page of the Azure portal for the Storage account resource
    • Select the appropriate container
    • Select the desired blob
    • Copy the URL under the Properties page

JSON Wrappers

As the REST API's take requests in JSON format and also return results in JSON we could interact with them using only strings, but that's not recommended. In order to make the requests and responses easier to manage, we'll declare a few classes to use for serializing / deserializing the JSON.

Go ahead and put their declarations after TranscribeAsync.

public class ModelIdentity
{
    ModelIdentity(Guid id) => Id = id;

    public Guid Id { get; private set; }

    public static ModelIdentity Create(Guid Id) => new ModelIdentity(Id);
}

public class Transcription
{
    [JsonConstructor]
    Transcription(
        Guid id,
        string name,
        string description,
        string locale,
        DateTime createdDateTime,
        DateTime lastActionDateTime,
        string status,
        Uri recordingsUrl,
        IReadOnlyDictionary<string, string> resultsUrls)
    {
        Id = id;
        Name = name;
        Description = description;
        CreatedDateTime = createdDateTime;
        LastActionDateTime = lastActionDateTime;
        Status = status;
        Locale = locale;
        RecordingsUrl = recordingsUrl;
        ResultsUrls = resultsUrls;
    }

    public string Name { get; set; }

    public string Description { get; set; }

    public string Locale { get; set; }

    public Uri RecordingsUrl { get; set; }

    public IReadOnlyDictionary<string, string> ResultsUrls { get; set; }

    public Guid Id { get; set; }

    public DateTime CreatedDateTime { get; set; }

    public DateTime LastActionDateTime { get; set; }

    public string Status { get; set; }

    public string StatusMessage { get; set; }
}

public class TranscriptionDefinition
{
    TranscriptionDefinition(
        string name,
        string description,
        string locale,
        Uri recordingsUrl,
        IEnumerable<ModelIdentity> models)
    {
        Name = name;
        Description = description;
        RecordingsUrl = recordingsUrl;
        Locale = locale;
        Models = models;
        Properties = new Dictionary<string, string>
        {
            ["PunctuationMode"] = "DictatedAndAutomatic",
            ["ProfanityFilterMode"] = "Masked",
            ["AddWordLevelTimestamps"] = "True"
        };
    }

    public string Name { get; set; }

    public string Description { get; set; }

    public Uri RecordingsUrl { get; set; }

    public string Locale { get; set; }

    public IEnumerable<ModelIdentity> Models { get; set; }

    public IDictionary<string, string> Properties { get; set; }

    public static TranscriptionDefinition Create(
        string name,
        string description,
        string locale,
        Uri recordingsUrl)
        => new TranscriptionDefinition(name, description, locale, recordingsUrl, new ModelIdentity[0]);
}

Create and configure an Http Client

The first thing we'll need is an Http Client that has a correct base URL and authentication set. Insert this code in TranscribeAsync.

var client = new HttpClient
{
    Timeout = TimeSpan.FromMinutes(25),
    BaseAddress = new UriBuilder(Uri.UriSchemeHttps, $"{Region}.cris.ai", Port).Uri,
    DefaultRequestHeaders =
    {
        { "Ocp-Apim-Subscription-Key", SubscriptionKey }
    }
};

Generate a transcription request

Next, we'll generate the transcription request. Add this code to TranscribeAsync.

var transcriptionDefinition =
    TranscriptionDefinition.Create(
        Name,
        Description,
        Locale,
        new Uri(RecordingsBlobUri));

var res = JsonConvert.SerializeObject(transcriptionDefinition);
var sc = new StringContent(res);
sc.Headers.ContentType = JsonMediaTypeFormatter.DefaultMediaType;

Send the request and check its status

Now we post the request to the Speech service and check the initial response code. This response code will simply indicate if the service has received the request. The service will return a Url in the response headers that's the location where it will store the transcription status.

Uri transcriptionLocation = null;
using (var response = await client.PostAsync($"{SpeechToTextBasePath}Transcriptions/", sc))
{
    if (!response.IsSuccessStatusCode)
    {
        Console.WriteLine("Error {0} starting transcription.", response.StatusCode);
        return;
    }

    transcriptionLocation = response.Headers.Location;
}

Wait for the transcription to complete

Since the service processes the transcription asynchronously, we need to poll for its status every so often. We'll check every 5 seconds.

We can check the status by retrieving the content at the Url we got when the posted the request. When we get the content back, we deserialize it into one of our helper class to make it easier to interact with.

Here's the polling code with status display for everything except a successful completion, we'll do that next.

Console.WriteLine($"Created transcription at location {transcriptionLocation}.");
Console.WriteLine("Checking status.");

var completed = false;

// Check for the status of our transcriptions periodically
while (!completed)
{
    Transcription transcription = null;
    using (var response = await client.GetAsync(transcriptionLocation.AbsolutePath))
    {
        var contentType = response.Content.Headers.ContentType;
        if (response.IsSuccessStatusCode &&
            string.Equals(contentType.MediaType, "application/json", StringComparison.OrdinalIgnoreCase))
        {
            transcription = await response.Content.ReadAsAsync<Transcription>();
        }
        else
        {
            Console.WriteLine("Error with status {0} getting transcription result", response.StatusCode);
            continue;
        }
    }

    switch (transcription.Status)
    {
        case "Failed":
            completed = true;
            Console.WriteLine("Transcription failed. Status: {0}", transcription.StatusMessage);
            break;

        case "Succeeded":
            break;

        case "Running":
            Console.WriteLine("Transcription is still running.");
            break;

        case "NotStarted":
            Console.WriteLine("Transcription has not started.");
            break;
    }

    await Task.Delay(TimeSpan.FromSeconds(5));
}

Console.WriteLine("Press any key...");
Console.ReadKey();

Display the transcription results

Once the service has successfully completed the transcription the results will be stored in another Url that we can get from the status response. Here we make a request to download those results in to a temporary file before reading and deserializing them. Once the results are loaded we can print them to the console. Add the following code to the case "Succeeded": label.

completed = true;
var webClient = new WebClient();
var filename = Path.GetTempFileName();
webClient.DownloadFile(transcription.ResultsUrls["channel_0"], filename);
var results = File.ReadAllText(filename);
Console.WriteLine($"Transcription succeeded. Results: {Environment.NewLine}{results}");
File.Delete(filename);

Check your code

At this point, your code should look like this: (We've added some comments to this version)

using Newtonsoft.Json;
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Threading.Tasks;
using System.Net.Http;
using System.Net.Http.Formatting;

namespace BatchClient
{
    class Program
    {
        // Replace with your subscription key
        const string SubscriptionKey = "YourSubscriptionKey";

        // Update with your service region
        const string Region = "YourServiceRegion";
        const int Port = 443;

        // Recordings and locale
        const string Locale = "en-US";
        const string RecordingsBlobUri = "YourFileUrl";

        // Name and description
        const string Name = "Simple transcription";
        const string Description = "Simple transcription description";

        const string SpeechToTextBasePath = "api/speechtotext/v2.0/";

        static async Task Main()
        {
            // For non-Windows 10 users.
            ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

            await TranscribeAsync();
        }

        static async Task TranscribeAsync()
        {
            Console.WriteLine("Starting transcriptions client...");

            // Create the client object and authenticate
            var client = new HttpClient
            {
                Timeout = TimeSpan.FromMinutes(25),
                BaseAddress = new UriBuilder(Uri.UriSchemeHttps, $"{Region}.cris.ai", Port).Uri,
                DefaultRequestHeaders =
                {
                    { "Ocp-Apim-Subscription-Key", SubscriptionKey }
                }
            };

            var transcriptionDefinition =
                TranscriptionDefinition.Create(
                    Name,
                    Description,
                    Locale,
                    new Uri(RecordingsBlobUri));

            var res = JsonConvert.SerializeObject(transcriptionDefinition);
            var sc = new StringContent(res);
            sc.Headers.ContentType = JsonMediaTypeFormatter.DefaultMediaType;

            Uri transcriptionLocation = null;

            using (var response = await client.PostAsync($"{SpeechToTextBasePath}Transcriptions/", sc))
            {
                if (!response.IsSuccessStatusCode)
                {
                    Console.WriteLine("Error {0} starting transcription.", response.StatusCode);
                    return;
                }

                transcriptionLocation = response.Headers.Location;
            }

            Console.WriteLine($"Created transcription at location {transcriptionLocation}.");
            Console.WriteLine("Checking status.");

            var completed = false;

            // Check for the status of our transcriptions periodically
            while (!completed)
            {
                Transcription transcription = null;

                // Get all transcriptions for the user
                using (var response = await client.GetAsync(transcriptionLocation.AbsolutePath))
                {
                    var contentType = response.Content.Headers.ContentType;
                    if (response.IsSuccessStatusCode &&
                        string.Equals(contentType.MediaType, "application/json", StringComparison.OrdinalIgnoreCase))
                    {
                        transcription = await response.Content.ReadAsAsync<Transcription>();
                    }
                    else
                    {
                        Console.WriteLine("Error with status {0} getting transcription result", response.StatusCode);
                        continue;
                    }
                }

                // For each transcription in the list we check the status
                switch (transcription.Status)
                {
                    case "Failed":
                        completed = true;
                        Console.WriteLine("Transcription failed. Status: {0}", transcription.StatusMessage);
                        break;

                    case "Succeeded":
                        completed = true;
                        var webClient = new WebClient();
                        var filename = Path.GetTempFileName();
                        webClient.DownloadFile(transcription.ResultsUrls["channel_0"], filename);
                        var results = File.ReadAllText(filename);
                        Console.WriteLine($"Transcription succeeded. Results: {Environment.NewLine}{results}");
                        File.Delete(filename);
                        break;

                    case "Running":
                        Console.WriteLine("Transcription is still running.");
                        break;

                    case "NotStarted":
                        Console.WriteLine("Transcription has not started.");
                        break;
                }

                await Task.Delay(TimeSpan.FromSeconds(5));
            }

            Console.WriteLine("Press any key...");
            Console.ReadKey();
        }
    }

    public class ModelIdentity
    {
        ModelIdentity(Guid id) => Id = id;

        public Guid Id { get; private set; }

        public static ModelIdentity Create(Guid Id) => new ModelIdentity(Id);
    }

    public class Transcription
    {
        [JsonConstructor]
        Transcription(
            Guid id,
            string name,
            string description,
            string locale,
            DateTime createdDateTime,
            DateTime lastActionDateTime,
            string status,
            Uri recordingsUrl,
            IReadOnlyDictionary<string, string> resultsUrls)
        {
            Id = id;
            Name = name;
            Description = description;
            CreatedDateTime = createdDateTime;
            LastActionDateTime = lastActionDateTime;
            Status = status;
            Locale = locale;
            RecordingsUrl = recordingsUrl;
            ResultsUrls = resultsUrls;
        }

        public string Name { get; set; }

        public string Description { get; set; }

        public string Locale { get; set; }

        public Uri RecordingsUrl { get; set; }

        public IReadOnlyDictionary<string, string> ResultsUrls { get; set; }

        public Guid Id { get; set; }

        public DateTime CreatedDateTime { get; set; }

        public DateTime LastActionDateTime { get; set; }

        public string Status { get; set; }

        public string StatusMessage { get; set; }
    }

    public class TranscriptionDefinition
    {
        TranscriptionDefinition(
            string name,
            string description,
            string locale,
            Uri recordingsUrl,
            IEnumerable<ModelIdentity> models)
        {
            Name = name;
            Description = description;
            RecordingsUrl = recordingsUrl;
            Locale = locale;
            Models = models;
            Properties = new Dictionary<string, string>
            {
                ["PunctuationMode"] = "DictatedAndAutomatic",
                ["ProfanityFilterMode"] = "Masked",
                ["AddWordLevelTimestamps"] = "True"
            };
        }

        public string Name { get; set; }

        public string Description { get; set; }

        public Uri RecordingsUrl { get; set; }

        public string Locale { get; set; }

        public IEnumerable<ModelIdentity> Models { get; set; }

        public IDictionary<string, string> Properties { get; set; }

        public static TranscriptionDefinition Create(
            string name,
            string description,
            string locale,
            Uri recordingsUrl)
            => new TranscriptionDefinition(name, description, locale, recordingsUrl, new ModelIdentity[0]);
    }
}

Build and run your app

Now you're ready to build your app and test our speech recognition using the Speech service.

  1. Compile the code - From the menu bar of Visual Studio, choose Build > Build Solution.
  2. Start your app - From the menu bar, choose Debug > Start Debugging or press F5.
  3. Start recognition - It'll prompt you to speak a phrase in English. Your speech is sent to the Speech service, transcribed as text, and rendered in the console.

Next steps

In this quickstart, you will use a REST API to recognize speech from files in a batch process. A batch process executes the speech transcription without any user interactions. It gives you a simple programming model, without the need to manage concurrency, custom speech models, or other details. It entails advanced control options, while making efficient use of Azure speech service resources.

For more information on the available options and configuration details, see batch transcription.

The following quickstart will walk you through a usage sample.

If you prefer to jump right in, view or download all Speech SDK C++ Samples on GitHub. Otherwise, let's get started.

Prerequisites

Before you get started, make sure to:

Open your project in Visual Studio

The first step is to make sure that you have your project open in Visual Studio.

  1. Launch Visual Studio 2019.
  2. Load your project and open helloworld.cpp.

Add a references

To speed up our code development we'll be using a couple of external components:

  • CPP Rest SDK A client library for making REST calls to a REST service.
  • nlohmann/json Handy JSON Parsing / Serialization / Deserialization library.

Both can be installed using vcpkg.

vcpkg install cpprestsdk cpprestsdk:x64-windows
vcpkg install nlohmann-json

Start with some boilerplate code

Let's add some code that works as a skeleton for our project.

#include <iostream>
#include <strstream>
#include <Windows.h>
#include <locale>
#include <codecvt>
#include <string>

#include <cpprest/http_client.h>
#include <cpprest/filestream.h>
#include <nlohmann/json.hpp>

using namespace std;
using namespace utility;                    // Common utilities like string conversions
using namespace web;                        // Common features like URIs.
using namespace web::http;                  // Common HTTP functionality
using namespace web::http::client;          // HTTP client features
using namespace concurrency::streams;       // Asynchronous streams
using json = nlohmann::json;

const string_t region = U("YourServiceRegion");
const string_t subscriptionKey = U("YourSubscriptionKey");
const string name = "Simple transcription";
const string description = "Simple transcription description";
const string myLocale = "en-US";
const string recordingsBlobUri = "YourFileUrl";

void recognizeSpeech()
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >> converter;

}

int wmain()
{
    recognizeSpeech();
    cout << "Please press a key to continue.\n";
    cin.get();
    return 0;
}

You'll need to replace the following values:

  • YourSubscriptionKey: found in the the Keys page of the Azure portal for the Speech resource
  • YourServiceRegion: found in the the Overview page of the Azure portal for the Speech resource
  • YourFileUrl: found in under the Blob service / Containers page of the Azure portal for the Storage account resource
    • Select the appropriate container
    • Select the desired blob
    • Copy the URL under the Properties page

JSON Wrappers

As the REST API's take requests in JSON format and also return results in JSON we could interact with them using only strings, but that's not recommended. In order to make the requests and responses easier to manage, we'll declare a few classes to use for serializing / deserializing the JSON and some methods to assist nlohmann/json.

Go ahead and put their declarations before recognizeSpeech .

class TranscriptionDefinition {
private:
    TranscriptionDefinition(string name,
        string description,
        string locale,
        string recordingsUrl,
        std::list<string> models) {

        Name = name;
        Description = description;
        RecordingsUrl = recordingsUrl;
        Locale = locale;
        Models = models;
    }

public:
    string Name;
    string Description;
    string RecordingsUrl;
    string Locale;
    std::list<string> Models;
    std::map<string, string> properties;

    static TranscriptionDefinition Create(string name, string description, string locale, string recordingsUrl) {
        return TranscriptionDefinition(name, description, locale, recordingsUrl, std::list<string>());
    }
    static TranscriptionDefinition Create(string name, string description, string locale, string recordingsUrl,
        std::list<string> models) {
        return TranscriptionDefinition(name, description, locale, recordingsUrl, models);
    }
};

void to_json(nlohmann::json& j, const TranscriptionDefinition& t) {
    j = nlohmann::json{
            { "description", t.Description },
            { "locale", t.Locale },
            { "models", t.Models },
            { "name", t.Name },
            { "properties", t.properties },
            { "recordingsurl",t.RecordingsUrl }
    };
};

void from_json(const nlohmann::json& j, TranscriptionDefinition& t) {
    j.at("locale").get_to(t.Locale);
    j.at("models").get_to(t.Models);
    j.at("name").get_to(t.Name);
    j.at("properties").get_to(t.properties);
    j.at("recordingsurl").get_to(t.RecordingsUrl);
}

class Transcription {
public:
    string name;
    string description;
    string locale;
    string recordingsUrl;
    map<string, string> resultsUrls;
    string id;
    string createdDateTime;
    string lastActionDateTime;
    string status;
    string statusMessage;
};

void to_json(nlohmann::json& j, const Transcription& t) {
    j = nlohmann::json{
            { "description", t.description },
            { "locale", t.locale },
            { "createddatetime", t.createdDateTime },
            { "name", t.name },
            { "id", t.id },
            { "recordingsurl",t.recordingsUrl },
            { "resultUrls", t.resultsUrls},
            { "status", t.status},
            { "statusmessage", t.statusMessage}
    };
};

void from_json(const nlohmann::json& j, Transcription& t) {
    j.at("description").get_to(t.description);
    j.at("locale").get_to(t.locale);
    j.at("createdDateTime").get_to(t.createdDateTime);
    j.at("name").get_to(t.name);
    j.at("recordingsUrl").get_to(t.recordingsUrl);
    t.resultsUrls = j.at("resultsUrls").get<map<string, string>>();
    j.at("status").get_to(t.status);
    j.at("statusMessage").get_to(t.statusMessage);
}
class Result
{
public:
    string Lexical;
    string ITN;
    string MaskedITN;
    string Display;
};
void from_json(const nlohmann::json& j, Result& r) {
    j.at("Lexical").get_to(r.Lexical);
    j.at("ITN").get_to(r.ITN);
    j.at("MaskedITN").get_to(r.MaskedITN);
    j.at("Display").get_to(r.Display);
}

class NBest : public Result
{
public:
    double Confidence;	
};
void from_json(const nlohmann::json& j, NBest& nb) {
    j.at("Confidence").get_to(nb.Confidence);
    j.at("Lexical").get_to(nb.Lexical);
    j.at("ITN").get_to(nb.ITN);
    j.at("MaskedITN").get_to(nb.MaskedITN);
    j.at("Display").get_to(nb.Display);
}

class SegmentResult
{
public:
    string RecognitionStatus;
    ULONG Offset;
    ULONG Duration;
    std::list<NBest> NBest;
};
void from_json(const nlohmann::json& j, SegmentResult& sr) {
    j.at("RecognitionStatus").get_to(sr.RecognitionStatus);
    j.at("Offset").get_to(sr.Offset);
    j.at("Duration").get_to(sr.Duration);
    sr.NBest = j.at("NBest").get<list<NBest>>();
}

class AudioFileResult
{
public:
    string AudioFileName;
    std::list<SegmentResult> SegmentResults;
    std::list<Result> CombinedResults;
};
void from_json(const nlohmann::json& j, AudioFileResult& arf) {
    j.at("AudioFileName").get_to(arf.AudioFileName);
    arf.SegmentResults = j.at("SegmentResults").get<list<SegmentResult>>();
    arf.CombinedResults = j.at("CombinedResults").get<list<Result>>();
}

class RootObject {
public:
    std::list<AudioFileResult> AudioFileResults;
};
void from_json(const nlohmann::json& j, RootObject& r) {
    r.AudioFileResults = j.at("AudioFileResults").get<list<AudioFileResult>>();
}

Create and configure an Http Client

The first thing we'll need is an Http Client that has a correct base URL and authentication set. Insert this code in recognizeSpeech

utility::string_t service_url = U("https://") + region + U(".cris.ai/api/speechtotext/v2.0/Transcriptions/");
uri u(service_url);

http_client c(u);
http_request msg(methods::POST);
msg.headers().add(U("Content-Type"), U("application/json"));
msg.headers().add(U("Ocp-Apim-Subscription-Key"), subscriptionKey);

Generate a transcription request

Next, we'll generate the transcription request. Add this code to recognizeSpeech

auto transportdef = TranscriptionDefinition::Create(name, description, myLocale, recordingsBlobUri);

nlohmann::json transportdefJSON = transportdef;

msg.set_body(transportdefJSON.dump());

Send the request and check its status

Now we post the request to the Speech service and check the initial response code. This response code will simply indicate if the service has received the request. The service will return a Url in the response headers that's the location where it will store the transcription status.


auto response = c.request(msg).get();
auto statusCode = response.status_code();

if (statusCode != status_codes::Accepted)
{
    cout << "Unexpected status code " << statusCode << endl;
    return;
}

string_t transcriptionLocation = response.headers()[U("location")];

cout << "Transcription status is located at " << converter.to_bytes(transcriptionLocation) << endl;

Wait for the transcription to complete

Since the service processes the transcription asynchronously, we need to poll for its status every so often. We'll check every 5 seconds.

We can check the status by retrieving the content at the Url we got when the posted the request. When we get the content back, we deserialize it into one of our helper class to make it easier to interact with.

Here's the polling code with status display for everything except a successful completion, we'll do that next.

bool completed = false;

while (!completed)
{
    auto statusResponse = statusCheckClient.request(statusCheckMessage).get();
    auto statusResponseCode = statusResponse.status_code();

    if (statusResponseCode != status_codes::OK)
    {
        cout << "Fetching the transcription returned unexpected http code " << statusResponseCode << endl;
        return;
    }

    auto body = statusResponse.extract_string().get();
    nlohmann::json statusJSON = nlohmann::json::parse(body);
    Transcription transcriptonStatus = statusJSON;

    if (!_stricmp(transcriptonStatus.status.c_str(), "Failed"))
    {
        completed = true;
        cout << "Transcription has failed " << transcriptonStatus.statusMessage << endl;
    }
    else if (!_stricmp(transcriptonStatus.status.c_str(), "Succeeded"))
    {
    }
    else if (!_stricmp(transcriptonStatus.status.c_str(), "Running"))
    {
        cout << "Transcription is running." << endl;
    }
    else if (!_stricmp(transcriptonStatus.status.c_str(), "NotStarted"))
    {
        cout << "Transcription has not started." << endl;
    }

    if (!completed) {
        Sleep(5000);
    }

}

Display the transcription results

Once the service has successfully completed the transcription the results will be stored in another Url that we can get from the status response.

We'll download the contents of that URL, deserialize the JSON, and loop through the results printing out the display text as we go.

completed = true;
cout << "Success!" << endl;
string result = transcriptonStatus.resultsUrls["channel_0"];
cout << "Transcription has completed. Results are at " << result << endl;
cout << "Fetching results" << endl;

http_client result_client(converter.from_bytes(result));
http_request resultMessage(methods::GET);
resultMessage.headers().add(U("Ocp-Apim-Subscription-Key"), subscriptionKey);

auto resultResponse = result_client.request(resultMessage).get();

auto responseCode = resultResponse.status_code();

if (responseCode != status_codes::OK)
{
    cout << "Fetching the transcription returned unexpected http code " << responseCode << endl;
    return;
}

auto resultBody = resultResponse.extract_string().get();

nlohmann::json resultJSON = nlohmann::json::parse(resultBody);
RootObject root = resultJSON;

for (AudioFileResult af : root.AudioFileResults)
{
    cout << "There were " << af.SegmentResults.size() << " results in " << af.AudioFileName << endl;
    
    for (SegmentResult segResult : af.SegmentResults)
    {
        cout << "Status: " << segResult.RecognitionStatus << endl;

        if (!_stricmp(segResult.RecognitionStatus.c_str(), "success") && segResult.NBest.size() > 0)
        {
            cout << "Best text result was: '" << segResult.NBest.front().Display << "'" << endl;
        }
    }
}

Check your code

At this point, your code should look like this: (We've added some comments to this version)

#include <iostream>
#include <strstream>
#include <Windows.h>
#include <locale>
#include <codecvt>
#include <string>

#include <cpprest/http_client.h>
#include <cpprest/filestream.h>
#include <nlohmann/json.hpp>

using namespace std;
using namespace utility;                    // Common utilities like string conversions
using namespace web;                        // Common features like URIs.
using namespace web::http;                  // Common HTTP functionality
using namespace web::http::client;          // HTTP client features
using namespace concurrency::streams;       // Asynchronous streams
using json = nlohmann::json;

const string_t region = U("YourServiceRegion");
const string_t subscriptionKey = U("YourSubscriptionKey");
const string name = "Simple transcription";
const string description = "Simple transcription description";
const string myLocale = "en-US";
const string recordingsBlobUri = "YourFileUrl";

class TranscriptionDefinition {
private:
    TranscriptionDefinition(string name,
        string description,
        string locale,
        string recordingsUrl,
        std::list<string> models) {

        Name = name;
        Description = description;
        RecordingsUrl = recordingsUrl;
        Locale = locale;
        Models = models;
    }

public:
    string Name;
    string Description;
    string RecordingsUrl;
    string Locale;
    std::list<string> Models;
    std::map<string, string> properties;

    static TranscriptionDefinition Create(string name, string description, string locale, string recordingsUrl) {
        return TranscriptionDefinition(name, description, locale, recordingsUrl, std::list<string>());
    }
    static TranscriptionDefinition Create(string name, string description, string locale, string recordingsUrl,
        std::list<string> models) {
        return TranscriptionDefinition(name, description, locale, recordingsUrl, models);
    }
};

void to_json(nlohmann::json& j, const TranscriptionDefinition& t) {
    j = nlohmann::json{
            { "description", t.Description },
            { "locale", t.Locale },
            { "models", t.Models },
            { "name", t.Name },
            { "properties", t.properties },
            { "recordingsurl",t.RecordingsUrl }
    };
};

void from_json(const nlohmann::json& j, TranscriptionDefinition& t) {
    j.at("locale").get_to(t.Locale);
    j.at("models").get_to(t.Models);
    j.at("name").get_to(t.Name);
    j.at("properties").get_to(t.properties);
    j.at("recordingsurl").get_to(t.RecordingsUrl);
}

class Transcription {
public:
    string name;
    string description;
    string locale;
    string recordingsUrl;
    map<string, string> resultsUrls;
    string id;
    string createdDateTime;
    string lastActionDateTime;
    string status;
    string statusMessage;
};

void to_json(nlohmann::json& j, const Transcription& t) {
    j = nlohmann::json{
            { "description", t.description },
            { "locale", t.locale },
            { "createddatetime", t.createdDateTime },
            { "name", t.name },
            { "id", t.id },
            { "recordingsurl",t.recordingsUrl },
            { "resultUrls", t.resultsUrls},
            { "status", t.status},
            { "statusmessage", t.statusMessage}
    };
};

void from_json(const nlohmann::json& j, Transcription& t) {
    j.at("description").get_to(t.description);
    j.at("locale").get_to(t.locale);
    j.at("createdDateTime").get_to(t.createdDateTime);
    j.at("name").get_to(t.name);
    j.at("recordingsUrl").get_to(t.recordingsUrl);
    t.resultsUrls = j.at("resultsUrls").get<map<string, string>>();
    j.at("status").get_to(t.status);
    j.at("statusMessage").get_to(t.statusMessage);
}
class Result
{
public:
    string Lexical;
    string ITN;
    string MaskedITN;
    string Display;
};
void from_json(const nlohmann::json& j, Result& r) {
    j.at("Lexical").get_to(r.Lexical);
    j.at("ITN").get_to(r.ITN);
    j.at("MaskedITN").get_to(r.MaskedITN);
    j.at("Display").get_to(r.Display);
}

class NBest : public Result
{
public:
    double Confidence;	
};
void from_json(const nlohmann::json& j, NBest& nb) {
    j.at("Confidence").get_to(nb.Confidence);
    j.at("Lexical").get_to(nb.Lexical);
    j.at("ITN").get_to(nb.ITN);
    j.at("MaskedITN").get_to(nb.MaskedITN);
    j.at("Display").get_to(nb.Display);
}

class SegmentResult
{
public:
    string RecognitionStatus;
    ULONG Offset;
    ULONG Duration;
    std::list<NBest> NBest;
};
void from_json(const nlohmann::json& j, SegmentResult& sr) {
    j.at("RecognitionStatus").get_to(sr.RecognitionStatus);
    j.at("Offset").get_to(sr.Offset);
    j.at("Duration").get_to(sr.Duration);
    sr.NBest = j.at("NBest").get<list<NBest>>();
}

class AudioFileResult
{
public:
    string AudioFileName;
    std::list<SegmentResult> SegmentResults;
    std::list<Result> CombinedResults;
};
void from_json(const nlohmann::json& j, AudioFileResult& arf) {
    j.at("AudioFileName").get_to(arf.AudioFileName);
    arf.SegmentResults = j.at("SegmentResults").get<list<SegmentResult>>();
    arf.CombinedResults = j.at("CombinedResults").get<list<Result>>();
}

class RootObject {
public:
    std::list<AudioFileResult> AudioFileResults;
};
void from_json(const nlohmann::json& j, RootObject& r) {
    r.AudioFileResults = j.at("AudioFileResults").get<list<AudioFileResult>>();
}


void recognizeSpeech()
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >> converter;

    utility::string_t service_url = U("https://") + region + U(".cris.ai/api/speechtotext/v2.0/Transcriptions/");
    uri u(service_url);

    http_client c(u);
    http_request msg(methods::POST);
    msg.headers().add(U("Content-Type"), U("application/json"));
    msg.headers().add(U("Ocp-Apim-Subscription-Key"), subscriptionKey);

    auto transportdef = TranscriptionDefinition::Create(name, description, myLocale, recordingsBlobUri);

    nlohmann::json transportdefJSON = transportdef;

    msg.set_body(transportdefJSON.dump());

    auto response = c.request(msg).get();
    auto statusCode = response.status_code();

    if (statusCode != status_codes::Accepted)
    {
        cout << "Unexpected status code " << statusCode << endl;
        return;
    }

    string_t transcriptionLocation = response.headers()[U("location")];

    cout << "Transcription status is located at " << converter.to_bytes(transcriptionLocation) << endl;

    http_client statusCheckClient(transcriptionLocation);
    http_request statusCheckMessage(methods::GET);
    statusCheckMessage.headers().add(U("Ocp-Apim-Subscription-Key"), subscriptionKey);

    bool completed = false;

    while (!completed)
    {
        auto statusResponse = statusCheckClient.request(statusCheckMessage).get();
        auto statusResponseCode = statusResponse.status_code();

        if (statusResponseCode != status_codes::OK)
        {
            cout << "Fetching the transcription returned unexpected http code " << statusResponseCode << endl;
            return;
        }

        auto body = statusResponse.extract_string().get();
        nlohmann::json statusJSON = nlohmann::json::parse(body);
        Transcription transcriptonStatus = statusJSON;

        if (!_stricmp(transcriptonStatus.status.c_str(), "Failed"))
        {
            completed = true;
            cout << "Transcription has failed " << transcriptonStatus.statusMessage << endl;
        }
        else if (!_stricmp(transcriptonStatus.status.c_str(), "Succeeded"))
        {
            completed = true;
            cout << "Success!" << endl;
            string result = transcriptonStatus.resultsUrls["channel_0"];
            cout << "Transcription has completed. Results are at " << result << endl;
            cout << "Fetching results" << endl;

            http_client result_client(converter.from_bytes(result));
            http_request resultMessage(methods::GET);
            resultMessage.headers().add(U("Ocp-Apim-Subscription-Key"), subscriptionKey);

            auto resultResponse = result_client.request(resultMessage).get();

            auto responseCode = resultResponse.status_code();

            if (responseCode != status_codes::OK)
            {
                cout << "Fetching the transcription returned unexpected http code " << responseCode << endl;
                return;
            }

            auto resultBody = resultResponse.extract_string().get();
            
            nlohmann::json resultJSON = nlohmann::json::parse(resultBody);
            RootObject root = resultJSON;

            for (AudioFileResult af : root.AudioFileResults)
            {
                cout << "There were " << af.SegmentResults.size() << " results in " << af.AudioFileName << endl;
                
                for (SegmentResult segResult : af.SegmentResults)
                {
                    cout << "Status: " << segResult.RecognitionStatus << endl;

                    if (!_stricmp(segResult.RecognitionStatus.c_str(), "success") && segResult.NBest.size() > 0)
                    {
                        cout << "Best text result was: '" << segResult.NBest.front().Display << "'" << endl;
                    }
                }
            }
        }
        else if (!_stricmp(transcriptonStatus.status.c_str(), "Running"))
        {
            cout << "Transcription is running." << endl;
        }
        else if (!_stricmp(transcriptonStatus.status.c_str(), "NotStarted"))
        {
            cout << "Transcription has not started." << endl;
        }

        if (!completed) {
            Sleep(5000);
        }

    }
}

int wmain()
{
    recognizeSpeech();
    cout << "Please press a key to continue.\n";
    cin.get();
    return 0;
}

Build and run your app

Now you're ready to build your app and test our speech recognition using the Speech service.

Next steps


In this quickstart, you will use a REST API to recognize speech from files in a batch process. A batch process executes the speech transcription without any user interactions. It gives you a simple programming model, without the need to manage concurrency, custom speech models, or other details. It entails advanced control options, while making efficient use of Azure speech service resources.

For more information on the available options and configuration details, see batch transcription.

The following quickstart will walk you through a usage sample.

If you prefer to jump right in, view or download all Speech SDK Java Samples on GitHub. Otherwise, let's get started.

Prerequisites

Before you get started, make sure to:

Open your project in Eclipse

The first step is to make sure that you have your project open in Eclipse.

  1. Launch Eclipse
  2. Load your project and open Main.java.

Add a reference to Gson

We'll be using an external JSON serializer / deserializer in this quickstart. For Java we've chosen Gson.

Open your pom.xml and add the following reference.

<dependencies>
  <dependency>
      <groupId>com.google.code.gson</groupId>
      <artifactId>gson</artifactId>
      <version>2.8.6</version>
  </dependency>
</dependencies>

Start with some boilerplate code

Let's add some code that works as a skeleton for our project.

package quickstart;

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Date;
import java.util.Dictionary;
import java.util.Hashtable;
import java.util.UUID;

import com.google.gson.Gson;
public class Main {

    private static String region = "YourServiceRegion";
    private static String subscriptionKey = "YourSubscriptionKey";
    private static String Locale = "en-US";
    private static String RecordingsBlobUri = "YourFileUrl";
    private static String Name = "Simple transcription";
    private static String Description = "Simple transcription description";

    public static void main(String[] args) throws IOException, InterruptedException {
        System.out.println("Starting transcriptions client...");
    }
}

You'll need to replace the following values:

  • YourSubscriptionKey: found in the the Keys page of the Azure portal for the Speech resource
  • YourServiceRegion: found in the the Overview page of the Azure portal for the Speech resource
  • YourFileUrl: found in under the Blob service / Containers page of the Azure portal for the Storage account resource
    • Select the appropriate container
    • Select the desired blob
    • Copy the URL under the Properties page

JSON Wrappers

As the REST API's take requests in JSON format and also return results in JSON we could interact with them using only strings, but that's not recommended. In order to make the requests and responses easier to manage, we'll declare a few classes to use for serializing / deserializing the JSON.

Go ahead and put their declarations before Main.

final class Transcription {
    public String name;
    public String description;
    public String locale;
    public URL recordingsUrl;
    public Hashtable<String, String> resultsUrls;
    public UUID id;
    public Date createdDateTime;
    public Date lastActionDateTime;
    public String status;
    public String statusMessage;
}

final class TranscriptionDefinition {
    private TranscriptionDefinition(String name, String description, String locale, URL recordingsUrl,
            ModelIdentity[] models) {
        this.Name = name;
        this.Description = description;
        this.RecordingsUrl = recordingsUrl;
        this.Locale = locale;
        this.Models = models;
        this.properties = new Hashtable<String, String>();
        this.properties.put("PunctuationMode", "DictatedAndAutomatic");
        this.properties.put("ProfanityFilterMode", "Masked");
        this.properties.put("AddWordLevelTimestamps", "True");
    }

    public String Name;
    public String Description;
    public URL RecordingsUrl;
    public String Locale;
    public ModelIdentity[] Models;
    public Dictionary<String, String> properties;

    public static TranscriptionDefinition Create(String name, String description, String locale, URL recordingsUrl) {
        return TranscriptionDefinition.Create(name, description, locale, recordingsUrl, new ModelIdentity[0]);
    }

    public static TranscriptionDefinition Create(String name, String description, String locale, URL recordingsUrl,
            ModelIdentity[] models) {
        return new TranscriptionDefinition(name, description, locale, recordingsUrl, models);
    }
}

final class ModelIdentity {
    private ModelIdentity(UUID id) {
        this.Id = id;
    }

    public UUID Id;

    public static ModelIdentity Create(UUID Id) {
        return new ModelIdentity(Id);
    }
}

class AudioFileResult {
    public String AudioFileName;
    public SegmentResult[] SegmentResults;
}

class RootObject {
    public AudioFileResult[] AudioFileResults;
}

class NBest {
    public double Confidence;
    public String Lexical;
    public String ITN;
    public String MaskedITN;
    public String Display;
}

class SegmentResult {
    public String RecognitionStatus;
    public String Offset;
    public String Duration;
    public NBest[] NBest;
}

Create and configure an Http Client

The first thing we'll need is an Http Client that has a correct base URL and authentication set. Insert this code in Main

String url = "https://" + region + ".cris.ai/api/speechtotext/v2.0/Transcriptions/";
URL serviceUrl = new URL(url);

HttpURLConnection postConnection = (HttpURLConnection) serviceUrl.openConnection();
postConnection.setDoOutput(true);
postConnection.setRequestMethod("POST");
postConnection.setRequestProperty("Content-Type", "application/json");
postConnection.setRequestProperty("Ocp-Apim-Subscription-Key", subscriptionKey);

Generate a transcription request

Next, we'll generate the transcription request. Add this code to Main

TranscriptionDefinition definition = TranscriptionDefinition.Create(Name, Description, Locale,
        new URL(RecordingsBlobUri));

Send the request and check its status

Now we post the request to the Speech service and check the initial response code. This response code will simply indicate if the service has received the request. The service will return a Url in the response headers that's the location where it will store the transcription status.

Gson gson = new Gson();

OutputStream stream = postConnection.getOutputStream();
stream.write(gson.toJson(definition).getBytes());
stream.flush();

int statusCode = postConnection.getResponseCode();

if (statusCode != HttpURLConnection.HTTP_ACCEPTED) {
    System.out.println("Unexpected status code " + statusCode);
    return;
}

Wait for the transcription to complete

Since the service processes the transcription asynchronously, we need to poll for its status every so often. We'll check every 5 seconds.

We can check the status by retrieving the content at the Url we got when the posted the request. When we get the content back, we deserialize it into one of our helper class to make it easier to interact with.

Here's the polling code with status display for everything except a successful completion, we'll do that next.

String transcriptionLocation = postConnection.getHeaderField("location");

System.out.println("Transcription is located at " + transcriptionLocation);

URL transcriptionUrl = new URL(transcriptionLocation);

boolean completed = false;
while (!completed) {
    {
        HttpURLConnection getConnection = (HttpURLConnection) transcriptionUrl.openConnection();
        getConnection.setRequestProperty("Ocp-Apim-Subscription-Key", subscriptionKey);
        getConnection.setRequestMethod("GET");

        int responseCode = getConnection.getResponseCode();

        if (responseCode != HttpURLConnection.HTTP_OK) {
            System.out.println("Fetching the transcription returned unexpected http code " + responseCode);
            return;
        }

        Transcription t = gson.fromJson(new InputStreamReader(getConnection.getInputStream()),
                Transcription.class);

        switch (t.status) {
        case "Failed":
            completed = true;
            System.out.println("Transcription has failed " + t.statusMessage);
            break;
        case "Succeeded":
            break;
        case "Running":
            System.out.println("Transcription is running.");
            break;
        case "NotStarted":
            System.out.println("Transcription has not started.");
            break;
        }

        if (!completed) {
            Thread.sleep(5000);
        }
    }

Display the transcription results

Once the service has successfully completed the transcription the results will be stored in another Url that we can get from the status response.

We'll download the contents of that URL, deserialize the JSON, and loop through the results printing out the display text as we go.

package quickstart;

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Date;
import java.util.Dictionary;
import java.util.Hashtable;
import java.util.UUID;

import com.google.gson.Gson;

final class Transcription {
    public String name;
    public String description;
    public String locale;
    public URL recordingsUrl;
    public Hashtable<String, String> resultsUrls;
    public UUID id;
    public Date createdDateTime;
    public Date lastActionDateTime;
    public String status;
    public String statusMessage;
}

final class TranscriptionDefinition {
    private TranscriptionDefinition(String name, String description, String locale, URL recordingsUrl,
            ModelIdentity[] models) {
        this.Name = name;
        this.Description = description;
        this.RecordingsUrl = recordingsUrl;
        this.Locale = locale;
        this.Models = models;
        this.properties = new Hashtable<String, String>();
        this.properties.put("PunctuationMode", "DictatedAndAutomatic");
        this.properties.put("ProfanityFilterMode", "Masked");
        this.properties.put("AddWordLevelTimestamps", "True");
    }

    public String Name;
    public String Description;
    public URL RecordingsUrl;
    public String Locale;
    public ModelIdentity[] Models;
    public Dictionary<String, String> properties;

    public static TranscriptionDefinition Create(String name, String description, String locale, URL recordingsUrl) {
        return TranscriptionDefinition.Create(name, description, locale, recordingsUrl, new ModelIdentity[0]);
    }

    public static TranscriptionDefinition Create(String name, String description, String locale, URL recordingsUrl,
            ModelIdentity[] models) {
        return new TranscriptionDefinition(name, description, locale, recordingsUrl, models);
    }
}

final class ModelIdentity {
    private ModelIdentity(UUID id) {
        this.Id = id;
    }

    public UUID Id;

    public static ModelIdentity Create(UUID Id) {
        return new ModelIdentity(Id);
    }
}

class AudioFileResult {
    public String AudioFileName;
    public SegmentResult[] SegmentResults;
}

class RootObject {
    public AudioFileResult[] AudioFileResults;
}

class NBest {
    public double Confidence;
    public String Lexical;
    public String ITN;
    public String MaskedITN;
    public String Display;
}

class SegmentResult {
    public String RecognitionStatus;
    public String Offset;
    public String Duration;
    public NBest[] NBest;
}

public class Main {

    private static String region = "YourServiceRegion";
    private static String subscriptionKey = "YourSubscriptionKey";
    private static String Locale = "en-US";
    private static String RecordingsBlobUri = "YourFileUrl";
    private static String Name = "Simple transcription";
    private static String Description = "Simple transcription description";

    public static void main(String[] args) throws IOException, InterruptedException {
        System.out.println("Starting transcriptions client...");
        String url = "https://" + region + ".cris.ai/api/speechtotext/v2.0/Transcriptions/";
        URL serviceUrl = new URL(url);

        HttpURLConnection postConnection = (HttpURLConnection) serviceUrl.openConnection();
        postConnection.setDoOutput(true);
        postConnection.setRequestMethod("POST");
        postConnection.setRequestProperty("Content-Type", "application/json");
        postConnection.setRequestProperty("Ocp-Apim-Subscription-Key", subscriptionKey);

        TranscriptionDefinition definition = TranscriptionDefinition.Create(Name, Description, Locale,
                new URL(RecordingsBlobUri));

        Gson gson = new Gson();

        OutputStream stream = postConnection.getOutputStream();
        stream.write(gson.toJson(definition).getBytes());
        stream.flush();

        int statusCode = postConnection.getResponseCode();

        if (statusCode != HttpURLConnection.HTTP_ACCEPTED) {
            System.out.println("Unexpected status code " + statusCode);
            return;
        }

        String transcriptionLocation = postConnection.getHeaderField("location");

        System.out.println("Transcription is located at " + transcriptionLocation);

        URL transcriptionUrl = new URL(transcriptionLocation);

        boolean completed = false;
        while (!completed) {
            {
                HttpURLConnection getConnection = (HttpURLConnection) transcriptionUrl.openConnection();
                getConnection.setRequestProperty("Ocp-Apim-Subscription-Key", subscriptionKey);
                getConnection.setRequestMethod("GET");

                int responseCode = getConnection.getResponseCode();

                if (responseCode != HttpURLConnection.HTTP_OK) {
                    System.out.println("Fetching the transcription returned unexpected http code " + responseCode);
                    return;
                }

                Transcription t = gson.fromJson(new InputStreamReader(getConnection.getInputStream()),
                        Transcription.class);

                switch (t.status) {
                case "Failed":
                    completed = true;
                    System.out.println("Transcription has failed " + t.statusMessage);
                    break;
                case "Succeeded":
                    completed = true;
                    String result = t.resultsUrls.get("channel_0");
                    System.out.println("Transcription has completed. Results are at " + result);

                    System.out.println("Fetching results");
                    URL resultUrl = new URL(result);

                    HttpURLConnection resultConnection = (HttpURLConnection) resultUrl.openConnection();
                    resultConnection.setRequestProperty("Ocp-Apim-Subscription-Key", subscriptionKey);
                    resultConnection.setRequestMethod("GET");

                    responseCode = resultConnection.getResponseCode();

                    if (responseCode != HttpURLConnection.HTTP_OK) {
                        System.out.println("Fetching the transcription returned unexpected http code " + responseCode);
                        return;
                    }

                    RootObject root = gson.fromJson(new InputStreamReader(resultConnection.getInputStream()),
                            RootObject.class);

                    for (AudioFileResult af : root.AudioFileResults) {
                        System.out
                                .println("There were " + af.SegmentResults.length + " results in " + af.AudioFileName);
                        for (SegmentResult segResult : af.SegmentResults) {
                            System.out.println("Status: " + segResult.RecognitionStatus);
                            if (segResult.RecognitionStatus.equalsIgnoreCase("success") && segResult.NBest.length > 0) {
                                System.out.println("Best text result was: '" + segResult.NBest[0].Display + "'");
                            }
                        }
                    }

                    break;
                case "Running":
                    System.out.println("Transcription is running.");
                    break;
                case "NotStarted":
                    System.out.println("Transcription has not started.");
                    break;
                }

                if (!completed) {
                    Thread.sleep(5000);
                }
            }
        }
    }
}

Check your code

At this point, your code should look like this: (We've added some comments to this version) [!code-java[](~/samples-cognitive-services-speech-sdk/quickstart/java/jre/from-blob/src/quickstart/Main.java]

Build and run your app

Now you're ready to build your app and test our speech recognition using the Speech service.

Next steps

In this quickstart, you will use a REST API to recognize speech from files in a batch process. A batch process executes the speech transcription without any user interactions. It gives you a simple programming model, without the need to manage concurrency, custom speech models, or other details. It entails advanced control options, while making efficient use of Azure speech service resources.

For more information on the available options and configuration details, see batch transcription.

The following quickstart will walk you through a usage sample.

If you prefer to jump right in, view or download all Speech SDK Python Samples on GitHub. Otherwise, let's get started.

Prerequisites

Before you get started, make sure to:

Download and install the API client library

To execute the sample you need to generate the Python library for the REST API which is generated through Swagger.

Follow these steps for the installation:

  1. Go to https://editor.swagger.io.
  2. Click File, then click Import URL.
  3. Enter the Swagger URL including the region for your Speech service subscription: https://<your-region>.cris.ai/docs/v2.0/swagger.
  4. Click Generate Client and select Python.
  5. Save the client library.
  6. Extract the downloaded python-client-generated.zip somewhere in your file system.
  7. Install the extracted python-client module in your Python environment using pip: pip install path/to/package/python-client.
  8. The installed package has the name swagger_client. You can check that the installation worked using the command python -c "import swagger_client".

Note

Due to a known bug in the Swagger autogeneration, you might encounter errors on importing the swagger_client package. These can be fixed by deleting the line with the content

from swagger_client.models.model import Model  # noqa: F401,E501

from the file swagger_client/models/model.py and the line with the content

from swagger_client.models.inner_error import InnerError  # noqa: F401,E501

from the file swagger_client/models/inner_error.py inside the installed package. The error message will tell you where these files are located for your installation.

Install other dependencies

The sample uses the requests library. You can install it with the command

pip install requests

Start with some boilerplate code

Let's add some code that works as a skeleton for our project.

#!/usr/bin/env python
# coding: utf-8
from typing import List

import logging
import sys
import requests
import time
import swagger_client as cris_client


logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format="%(message)s")

# Your subscription key and region for the speech service
SUBSCRIPTION_KEY = "YourSubscriptionKey"
SERVICE_REGION = "YourServiceRegion"

NAME = "Simple transcription"
DESCRIPTION = "Simple transcription description"

LOCALE = "en-US"
RECORDINGS_BLOB_URI = "<Your SAS Uri to the recording>"

# Set subscription information when doing transcription with custom models
ADAPTED_ACOUSTIC_ID = None  # guid of a custom acoustic model
ADAPTED_LANGUAGE_ID = None  # guid of a custom language model


def transcribe():
    logging.info("Starting transcription client...")


if __name__ == "__main__":
    transcribe()

You'll need to replace the following values:

  • YourSubscriptionKey: found in the the Keys page of the Azure portal for the Speech resource
  • YourServiceRegion: found in the the Overview page of the Azure portal for the Speech resource
  • YourFileUrl: found in under the Blob service / Containers page of the Azure portal for the Storage account resource
    • Select the appropriate container
    • Select the desired blob
    • Copy the URL under the Properties page

Create and configure an Http Client

The first thing we'll need is an Http Client that has a correct base URL and authentication set. Insert this code in transcribe

configuration = cris_client.Configuration()
configuration.api_key['Ocp-Apim-Subscription-Key'] = SUBSCRIPTION_KEY
configuration.host = "https://{}.cris.ai".format(SERVICE_REGION)

# create the client object and authenticate
client = cris_client.ApiClient(configuration)

# create an instance of the transcription api class
transcription_api = cris_client.CustomSpeechTranscriptionsApi(api_client=client)

Generate a transcription request

Next, we'll generate the transcription request. Add this code to transcribe

transcription_definition = cris_client.TranscriptionDefinition(
    name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI
)

Send the request and check its status

Now we post the request to the Speech service and check the initial response code. This response code will simply indicate if the service has received the request. The service will return a Url in the response headers that's the location where it will store the transcription status.

data, status, headers = transcription_api.create_transcription_with_http_info(transcription_definition)

# extract transcription location from the headers
transcription_location: str = headers["location"]

# get the transcription Id from the location URI
created_transcription: str = transcription_location.split('/')[-1]

logging.info("Created new transcription with id {}".format(created_transcription))

Wait for the transcription to complete

Since the service processes the transcription asynchronously, we need to poll for its status every so often. We'll check every 5 seconds.

We'll enumerate all the transcriptions that this Speech service resource is processing and look for the one we created.

Here's the polling code with status display for everything except a successful completion, we'll do that next.

logging.info("Checking status.")

completed = False

while not completed:
    running, not_started = 0, 0

    # get all transcriptions for the user
    transcriptions: List[cris_client.Transcription] = transcription_api.get_transcriptions()

    # for each transcription in the list we check the status
    for transcription in transcriptions:
        if transcription.status in ("Failed", "Succeeded"):
            # we check to see if it was the transcription we created from this client
            if created_transcription != transcription.id:
                continue

            completed = True

            if transcription.status == "Succeeded":
            else:
                logging.info("Transcription failed :{}.".format(transcription.status_message))
                break
        elif transcription.status == "Running":
            running += 1
        elif transcription.status == "NotStarted":
            not_started += 1

    logging.info("Transcriptions status: "
            "completed (this transcription): {}, {} running, {} not started yet".format(
                completed, running, not_started))

    # wait for 5 seconds
    time.sleep(5)

Display the transcription results

Once the service has successfully completed the transcription the results will be stored in another Url that we can get from the status response.

Here we get that result JSON and display it.

results_uri = transcription.results_urls["channel_0"]
results = requests.get(results_uri)
logging.info("Transcription succeeded. Results: ")
logging.info(results.content.decode("utf-8"))

Check your code

At this point, your code should look like this: (We've added some comments to this version)

#!/usr/bin/env python
# coding: utf-8

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license. See LICENSE.md file in the project root for full license information.

from typing import List

import logging
import sys
import requests
import time
import swagger_client as cris_client


logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format="%(message)s")

# Your subscription key and region for the speech service
SUBSCRIPTION_KEY = "YourSubscriptionKey"
SERVICE_REGION = "YourServiceRegion"

NAME = "Simple transcription"
DESCRIPTION = "Simple transcription description"

LOCALE = "en-US"
RECORDINGS_BLOB_URI = "<Your SAS Uri to the recording>"

# Set subscription information when doing transcription with custom models
ADAPTED_ACOUSTIC_ID = None  # guid of a custom acoustic model
ADAPTED_LANGUAGE_ID = None  # guid of a custom language model


def transcribe():
    logging.info("Starting transcription client...")

    # configure API key authorization: subscription_key
    configuration = cris_client.Configuration()
    configuration.api_key['Ocp-Apim-Subscription-Key'] = SUBSCRIPTION_KEY
    configuration.host = "https://{}.cris.ai".format(SERVICE_REGION)

    # create the client object and authenticate
    client = cris_client.ApiClient(configuration)

    # create an instance of the transcription api class
    transcription_api = cris_client.CustomSpeechTranscriptionsApi(api_client=client)

    # Use base models for transcription. Comment this block if you are using a custom model.
    # Note: you can specify additional transcription properties by passing a
    # dictionary in the properties parameter. See
    # https://docs.microsoft.com/azure/cognitive-services/speech-service/batch-transcription
    # for supported parameters.
    transcription_definition = cris_client.TranscriptionDefinition(
        name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI
    )

    # Uncomment this block to use custom models for transcription.
    # Model information (ADAPTED_ACOUSTIC_ID and ADAPTED_LANGUAGE_ID) must be set above.
    # if ADAPTED_ACOUSTIC_ID is None or ADAPTED_LANGUAGE_ID is None:
    #     logging.info("Custom model ids must be set to when using custom models")
    # transcription_definition = cris_client.TranscriptionDefinition(
    #     name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI,
    #     models=[cris_client.ModelIdentity(ADAPTED_ACOUSTIC_ID), cris_client.ModelIdentity(ADAPTED_LANGUAGE_ID)]
    # )

    data, status, headers = transcription_api.create_transcription_with_http_info(transcription_definition)

    # extract transcription location from the headers
    transcription_location: str = headers["location"]

    # get the transcription Id from the location URI
    created_transcription: str = transcription_location.split('/')[-1]

    logging.info("Created new transcription with id {}".format(created_transcription))

    logging.info("Checking status.")

    completed = False

    while not completed:
        running, not_started = 0, 0

        # get all transcriptions for the user
        transcriptions: List[cris_client.Transcription] = transcription_api.get_transcriptions()

        # for each transcription in the list we check the status
        for transcription in transcriptions:
            if transcription.status in ("Failed", "Succeeded"):
                # we check to see if it was the transcription we created from this client
                if created_transcription != transcription.id:
                    continue

                completed = True

                if transcription.status == "Succeeded":
                    results_uri = transcription.results_urls["channel_0"]
                    results = requests.get(results_uri)
                    logging.info("Transcription succeeded. Results: ")
                    logging.info(results.content.decode("utf-8"))
                else:
                    logging.info("Transcription failed :{}.".format(transcription.status_message))
                    break
            elif transcription.status == "Running":
                running += 1
            elif transcription.status == "NotStarted":
                not_started += 1

        logging.info("Transcriptions status: "
                "completed (this transcription): {}, {} running, {} not started yet".format(
                    completed, running, not_started))

        # wait for 5 seconds
        time.sleep(5)

    input("Press any key...")


if __name__ == "__main__":
    transcribe()

Build and run your app

Now you're ready to build your app and test our speech recognition using the Speech service.

Next steps

In this quickstart, you will use a REST API to recognize speech from files in a batch process. A batch process executes the speech transcription without any user interactions. It gives you a simple programming model, without the need to manage concurrency, custom speech models, or other details. It entails advanced control options, while making efficient use of Azure speech service resources.

For more information on the available options and configuration details, see batch transcription.

The following quickstart will walk you through a usage sample.

If you prefer to jump right in, view or download all Speech SDK JavaScript Samples on GitHub. Otherwise, let's get started.

Prerequisites

Before you get started, make sure to:

Create a new JS file

The first step is to make sure that you have your project open in your favorite editor.

Call your file index.js.

Start with some boilerplate code

Let's add some code that works as a skeleton for our project.

const https = require("https");

// Replace with your subscription key
SubscriptionKey = "YourSubscriptionKey";

// Update with your service region
Region = "YourServiceRegion";
Port = 443;

// Recordings and locale
Locale = "en-US";
RecordingsBlobUri = "YourFileUrl";

// Name and description
Name = "Simple transcription";
Description = "Simple transcription description";

SpeechToTextBasePath = "/api/speechtotext/v2.0/";

You'll need to replace the following values:

  • YourSubscriptionKey: found in the the Keys page of the Azure portal for the Speech resource
  • YourServiceRegion: found in the the Overview page of the Azure portal for the Speech resource
  • YourFileUrl: found in under the Blob service / Containers page of the Azure portal for the Storage account resource
    • Select the appropriate container
    • Select the desired blob
    • Copy the URL under the Properties page

JSON Wrappers

As the REST API's take requests in JSON format and also return results in JSON. In order to make the requests and responses easier to understand, we'll declare a few classes to use for serializing / deserializing the JSON.

class ModelIdentity {
    id;
}

class Transcription {
    Name;
    Description;
    Locale;
    RecordingsUrl;
    ResultsUrls;
    Id;
    CreatedDateTime;
    LastActionDateTime;
    Status;
    StatusMessage;
}

class TranscriptionDefinition {
    Name;
    Description;
    RecordingsUrl;
    Locale;
    Models;
    Properties;
}

Create an initial transcription request.

Next, we'll generate the transcription request.

const ts = {
    Name: Name,
    Description: Description,
    Locale: Locale,
    RecordingsUrl: RecordingsBlobUri,
    Properties: {
        "PunctuationMode": "DictatedAndAutomatic",
        "ProfanityFilterMode": "Masked",
        "AddWordLevelTimestamps": "True"
    },
    Models: []
}

const postPayload = JSON.stringify(ts);

const startOptions = {
    hostname: Region + ".cris.ai",
    port: Port,
    path: SpeechToTextBasePath + "Transcriptions/",
    method: "POST",
    headers: {
        "Content-Type": "application/json",
        'Content-Length': postPayload.length,
        "Ocp-Apim-Subscription-Key": SubscriptionKey
    }
}

Send the transcription request.

Now we post the request to the Speech service and check the initial response code. This response code will simply indicate if the service has received the request. The service will return a Url in the response headers that's the location where it will store the transcription status.

Then we'll call a method CheckTranscriptionStatus to check on the status and eventually print the results. We'll implement CheckTranscriptionStatus next.

const request = https.request(startOptions, (response) => {
    if (response.statusCode != 202) {
        console.error("Error, status code " + response.statusCode);
    } else {

        const transcriptionLocation = response.headers.location;

        console.info("Created transcription at location " + transcriptionLocation);
        console.info("Checking status.");

        CheckTranscriptionStatus(transcriptionLocation);
    }
});

request.on("error", error => {
    console.error(error);
});

request.write(postPayload);
request.end();

Check the requests status

Since the service processes the transcription asynchronously, we need to poll for its status every so often. We'll check every 5 seconds.

We can check the status by retrieving the content at the Url we got when the posted the request. When we get the content back, we deserialize it into one of our helper class to make it easier to interact with.

Here's the polling code with status display for everything except a successful completion, we'll do that next.

CheckTranscriptionStatus takes the status URL from the transcription request and polls it every 5 seconds until it indicates success or and error. It then calls PrintResults to print the results of the transcription. We'll implement PrintResults next.

function CheckTranscriptionStatus(statusUrl) {
    transcription = null;
    const fetchOptions = {
        headers: {
            "Ocp-Apim-Subscription-Key": SubscriptionKey
        }
    }

    const fetchRequest = https.get(new URL(statusUrl), fetchOptions, (response) => {
        if (response.statusCode !== 200) {
            console.info("Error retrieving status: " + response.statusCode);
        } else {
            let responseText = '';
            response.setEncoding('utf8');
            response.on("data", (chunk) => {
                responseText += chunk;
            });

            response.on("end", () => {
                const statusObject = JSON.parse(responseText);

                var done = false;
                switch (statusObject.status) {
                    case "Failed":
                        console.info("Transcription failed. Status: " + transcription.StatusMessage);
                        done = true;
                        break;
                    case "Succeeded":
                        done = true;
                        PrintResults(statusObject.resultsUrls["channel_0"]);
                        break;
                    case "Running":
                        console.info("Transcription is still running.");
                        break;
                    case "NotStarted":
                        console.info("Transcription has not started.");
                        break;
                }

                if (!done) {
                    setTimeout(() => {
                        CheckTranscriptionStatus(statusUrl);
                    }, (5000));
                }
            });
        }
    });

    fetchRequest.on("error", error => {
        console.error(error);
    });
}

Display the transcription results

Once the service has successfully completed the transcription the results will be stored in another Url that we can get from the status response. Here we make a request to download those results in to a temporary file before reading and deserializing them. Once the results are loaded we can print them to the console.

function PrintResults(resultUrl)
{
    const fetchOptions = {
        headers: {
            "Ocp-Apim-Subscription-Key": SubscriptionKey
        }
    }

    const fetchRequest = https.get(new URL(resultUrl), fetchOptions, (response) => {
        if (response.statusCode !== 200) {
            console.info("Error retrieving status: " + response.statusCode);
        } else {
            let responseText = '';
            response.setEncoding('utf8');
            response.on("data", (chunk) => {
                responseText += chunk;
            });

            response.on("end", () => {
                console.info("Transcription Results:");
                console.info(responseText);
            });
        }
    });
}

Check your code

At this point, your code should look like this:

const https = require("https");

// Replace with your subscription key
SubscriptionKey = "YourSubscriptionKey";

// Update with your service region
Region = "YourServiceRegion";
Port = 443;

// Recordings and locale
Locale = "en-US";
RecordingsBlobUri = "YourFileUrl";

// Name and description
Name = "Simple transcription";
Description = "Simple transcription description";

SpeechToTextBasePath = "/api/speechtotext/v2.0/";

class ModelIdentity {
    id;
}

class Transcription {
    Name;
    Description;
    Locale;
    RecordingsUrl;
    ResultsUrls;
    Id;
    CreatedDateTime;
    LastActionDateTime;
    Status;
    StatusMessage;
}

class TranscriptionDefinition {
    Name;
    Description;
    RecordingsUrl;
    Locale;
    Models;
    Properties;
}

const ts = {
    Name: Name,
    Description: Description,
    Locale: Locale,
    RecordingsUrl: RecordingsBlobUri,
    Properties: {
        "PunctuationMode": "DictatedAndAutomatic",
        "ProfanityFilterMode": "Masked",
        "AddWordLevelTimestamps": "True"
    },
    Models: []
}

const postPayload = JSON.stringify(ts);

const startOptions = {
    hostname: Region + ".cris.ai",
    port: Port,
    path: SpeechToTextBasePath + "Transcriptions/",
    method: "POST",
    headers: {
        "Content-Type": "application/json",
        'Content-Length': postPayload.length,
        "Ocp-Apim-Subscription-Key": SubscriptionKey
    }
}

function PrintResults(resultUrl)
{
    const fetchOptions = {
        headers: {
            "Ocp-Apim-Subscription-Key": SubscriptionKey
        }
    }

    const fetchRequest = https.get(new URL(resultUrl), fetchOptions, (response) => {
        if (response.statusCode !== 200) {
            console.info("Error retrieving status: " + response.statusCode);
        } else {
            let responseText = '';
            response.setEncoding('utf8');
            response.on("data", (chunk) => {
                responseText += chunk;
            });

            response.on("end", () => {
                console.info("Transcription Results:");
                console.info(responseText);
            });
        }
    });
}

function CheckTranscriptionStatus(statusUrl) {
    transcription = null;
    const fetchOptions = {
        headers: {
            "Ocp-Apim-Subscription-Key": SubscriptionKey
        }
    }

    const fetchRequest = https.get(new URL(statusUrl), fetchOptions, (response) => {
        if (response.statusCode !== 200) {
            console.info("Error retrieving status: " + response.statusCode);
        } else {
            let responseText = '';
            response.setEncoding('utf8');
            response.on("data", (chunk) => {
                responseText += chunk;
            });

            response.on("end", () => {
                const statusObject = JSON.parse(responseText);

                var done = false;
                switch (statusObject.status) {
                    case "Failed":
                        console.info("Transcription failed. Status: " + transcription.StatusMessage);
                        done = true;
                        break;
                    case "Succeeded":
                        done = true;
                        PrintResults(statusObject.resultsUrls["channel_0"]);
                        break;
                    case "Running":
                        console.info("Transcription is still running.");
                        break;
                    case "NotStarted":
                        console.info("Transcription has not started.");
                        break;
                }

                if (!done) {
                    setTimeout(() => {
                        CheckTranscriptionStatus(statusUrl);
                    }, (5000));
                }
            });
        }
    });

    fetchRequest.on("error", error => {
        console.error(error);
    });
}

const request = https.request(startOptions, (response) => {
    if (response.statusCode != 202) {
        console.error("Error, status code " + response.statusCode);
    } else {

        const transcriptionLocation = response.headers.location;

        console.info("Created transcription at location " + transcriptionLocation);
        console.info("Checking status.");

        CheckTranscriptionStatus(transcriptionLocation);
    }
});

request.on("error", error => {
    console.error(error);
});

request.write(postPayload);
request.end();

Run your app

Now you're ready to build your app and test our speech recognition using the Speech service.

Start your app - Run node index.js.

Next steps

View or download all Speech SDK Samples on GitHub.

Additional language and platform support

If you've clicked this tab, you probably didn't see a quickstart in your favorite programming language. Don't worry, we have additional quickstart materials and code samples available on GitHub. Use the table to find the right sample for your programming language and platform/OS combination.

Language Additional Quickstarts Code samples
C# From mic, From file .NET Framework, .NET Core, UWP, Unity, Xamarin
C++ From mic, From file Windows, Linux, macOS
Java From mic, From file Android, JRE
JavaScript Browser from mic, Node.js from file Windows, Linux, macOS
Objective-C iOS from mic, macOS from mic iOS, macOS
Python From mic, From file Windows, Linux, macOS
Swift iOS from mic, macOS from mic iOS, macOS