Missing apostrophes when uploading human-labeled transcript for custom speech

Virtro Dev 21 Reputation points
2021-01-18T18:23:43.487+00:00

Hi, I am currently trying to create a custom STT using Custom Speech service, after uploading my Audio + human-labeled transcript (txt file, separated by \t, UTF-8 with BOM) , a lot of the apostrophes are missing in the Human-labeled transcription (normalized). such as don't become don t, can't become can t and more, however there are some exceptions such as there's and it's are labeled correctly. Due to the incorrect labels, I can't train and test my custom model correctly. Please Help!

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,413 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,418 questions
{count} votes

Accepted answer
  1. romungi-MSFT 42,316 Reputation points Microsoft Employee
    2021-01-19T05:27:12.337+00:00

    @Virtro Dev I suspect some of the apostrophes might be Latin-1 or Unicode that could have crept in your transcript. The guidance around using apostrophes is to replace them with appropriate ASCII substitution. Here is the guidance for the same in the documentation. Since some of your words do not have issues after processing it is easier to replace the incorrect apostrophes with the ones that go through. Thanks!!


0 additional answers

Sort by: Most helpful