question

PaulNerie-9756 avatar image
1 Vote"
PaulNerie-9756 asked PaulNerie-9756 commented

Speech CLI speech to text output format, and using MP3 as input format

I'm trying out this Cognitive Service speech-to-text CLI:

https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/speech-to-text-from-file?tabs=linux%2Cbrowser%2Cwindowsinstall&pivots=programmer-tool-spx

I can generate an output file, but the transcript is just one very long line.

Is there a way to have some line breaks somehow? Maybe when the service detects a pause, it inserts a line break?

Also when I try to use an MP3 file (using --format mp3), I get this error: ERROR: Exception with an error code: 0x29 (SPXERR_GSTREAMER_NOT_FOUND_ERROR) Do I need to install anything else for MP3 to work?

Thank you very much.

azure-cognitive-servicesazure-speech
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

GiftA-MSFT avatar image
0 Votes"
GiftA-MSFT answered PaulNerie-9756 commented

Hi, thanks for the updates. In response to your first question, we current don't have paragraph level support. However, you can easily separate the output text based on sentence ending with period and timestamp to decide the pause between two sentences before breaking them into paragraphs. I have provided this feedback internally to the product team but feel free to share this request on uservoice forum so you and others can up-vote to enable product team prioritize this feature. Thanks.



· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks for the tip! It would be nice not to have an additional step to parse/format the output text.

0 Votes 0 ·
PaulNerie-9756 avatar image
0 Votes"
PaulNerie-9756 answered PaulNerie-9756 commented

For anyone who stumbles upon this and have the same question about the error I posted above, there actually is a documentation for it:

https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-codec-compressed-audio-input-streams?tabs=debian&pivots=programming-language-csharp

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

So after further tinkering I found out you have to use this version of GStreamer, otherwise the necessary library file (libgstreamer-1.0-0.dll) will not be available:
https://gstreamer.freedesktop.org/data/pkg/windows/1.15.1/

I used this file:
gstreamer-1.0-x86_64-1.15.1.msi

1 Vote 1 ·
robch avatar image
0 Votes"
robch answered PaulNerie-9756 commented

Using the Speech CLI (spx), you can output different types of files, that will have the text "chunked" in different ways...

  • By default, spx will output "all" recognized events, concatenated, in a file like "output.*.tsv"

  • Using --output each command line options, it will also produce a file with "each" recognized event, thus, the text will be in chunks

  • Uy using --output batch json command line option, it will also/or instead produce a file with "all" recognized text concatenated

See spx help recognize output for more details on each of the three options.

Thus, for your specific question, you can do this: spx recognize --file example.wav --output each recognized text --output each file output.tsv --output each tsv file has header false

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you. I'll check it out.

0 Votes 0 ·