question

45286012 avatar image
0 Votes"
45286012 asked GiftA-MSFT commented

Speech to text on two streams

I'm trying to run realtime STT on two streams (one through mic, one through speaker). These are the options I'm considering:

  1. combine both streams into one and use the native diarization capability

  2. use the multichannel capability

  3. create two separate sessions

Option 1: I'm considering using PullAudioInputStream & combining both streams. But I'm using the Javascript SDK and I'm unable to figure out how to set diarization option. Additionally, it seems diarization is not that great just yet.

Option 2: this seems to be limited to the Conversation Transcription API but that requires 7 mics etc. Not viable for my use case.

Option 3: create two separate sessions - one per each stream. This would 2x the cost and I'd lose synchronization between the two streams.

Any thoughts?



azure-cognitive-servicesazure-speech
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

GiftA-MSFT avatar image
1 Vote"
GiftA-MSFT answered GiftA-MSFT commented

Hi, thanks for reaching out. Please review response below:

  1. Can you elaborate on what you're trying here?

  2. Yes, for now there isn't a mono-conversation transcriber that's available. Although, we may be able to put you in touch with our product team to find out whether there's a private workaround.

  3. If the streams start at the same time, the offsets could be used to reconcile the streams.

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you for the response !

My overall goal is to transcribe both sides of a conversation (using mic & speaker).

Option 1 was to combine both these streams into a single stream. Then run diarization on the combined stream. Now that I think about it, it's not a great idea. There might be cross-talk which will skew the diarization. Plus from other posts, I understand diarization isn't that great at this time.

Option 2 - Would a "mono-conversation transcriber" support my use case? If yes, would love to find out the workaround.

Option 3 - would there be a performance implication? I'm using the Javascript SDK (essentially a node app wrapped in Electron).

0 Votes 0 ·

Hi, at this time, the best advice is to use two streams or wait for the Conversation Transcriber service to support something other than the 8 channel array. Sorry for any inconvenience. Hope this helps.

1 Vote 1 ·