Hi, I am currently trying to create a custom STT using Custom Speech service, after uploading my Audio + human-labeled transcript (txt file, separated by \t, UTF-8 with BOM) , a lot of the apostrophes are missing in the Human-labeled transcription (normalized). such as don't become don t, can't become can t and more, however there are some exceptions such as there's and it's are labeled correctly. Due to the incorrect labels, I can't train and test my custom model correctly. Please Help!

@Virtro Dev I suspect some of the apostrophes might be Latin-1 or Unicode that could have crept in your transcript. The guidance around using apostrophes is to replace them with appropriate ASCII substitution. Here is the guidance for the same in the documentation . Since some of your words do not have issues after processing it is easier to replace the incorrect apostrophes with the ones that go through. Thanks!!

Missing apostrophes when uploading human-labeled transcript for custom speech

Accepted answer

romungi-MSFT 42,316 Reputation points Microsoft Employee

2021-01-19T05:27:12.337+00:00

@Virtro Dev I suspect some of the apostrophes might be Latin-1 or Unicode that could have crept in your transcript. The guidance around using apostrophes is to replace them with appropriate ASCII substitution. Here is the guidance for the same in the documentation. Since some of your words do not have issues after processing it is easier to replace the incorrect apostrophes with the ones that go through. Thanks!!
Please sign in to rate this answer.
Virtro Dev 21 Reputation points

2021-01-19T07:26:43.337+00:00

Thank you very much, I will give it try tmr~ thanks for the help

Virtro Dev 21 Reputation points

2021-01-19T07:45:42.497+00:00

Hi, I just ran some python code to check if my strings are ASCII and they all returned TRUE. I believe I am using the appropriate ASCII apostrophes.

I tried:
def is_ascii(s):
return all(ord(c) < 128 for c in s)

and string.isascii()

Virtro Dev 21 Reputation points

2021-01-19T19:42:16.047+00:00

I also tried to substitute the apostrophes that did not go through with the one that did, no luck there. same error, the apostrophes that were missing are still missing and the apostrophes that showed correctly are still showing up correctly.

romungi-MSFT 42,316 Reputation points Microsoft Employee

2021-01-20T08:24:32.48+00:00

@Virtro Dev I have tried some sentences with apostrophe's and it seems to process them. These are keyed in a notepad document and uploaded to the portal. Your scenario though might be different with many files or sentences.

Is is possible to share your document to test it?

Virtro Dev 21 Reputation points

2021-01-20T17:43:33.43+00:00

@romungi-MSFT thanks for the help. I realized the issue is with the Language I chose when I created my project. The apostrophe issue happens because I am using English (Australia):

The following image is when I up loaded the same file under a project that is English (united states). It seems the apostrophe error does not exist in here:

What would be the difference between English (Australia) and English (united states) what would cause this issue?

romungi-MSFT 42,316 Reputation points Microsoft Employee

2021-01-21T15:09:46.537+00:00

@VitroDev-7604 This looks like a bug in the portal and our team is working to fix this for all en-* locales. We will update the thread as soon as we have on update on the fixes.

Virtro Dev 21 Reputation points

2021-01-22T17:25:24.557+00:00

Thank you very much

AnalyticsVirtro 1 Reputation point

2021-01-22T21:30:35.95+00:00

Thank you indeed!

It looks as if we are also having unexpected result with our en_AU versus en_US STT model testing/training.

With labeled training data focusing on AU accent, we are getting lower error rate with en_US model than en_AU model, both in base and additionally trained versions . Is this expected? We appreciate comments.

Virtro Dev 21 Reputation points

2021-02-03T01:58:59.713+00:00

@romungi-MSFT
Hi, I just want to follow up on this issue, any luck on fixing it? and there is another issue with the portal, when up loading more than 400 audios + transcriptions to portal, the some transcriptions will be missing at random. could you also take a look at that? Thanks~~ have a great day

romungi-MSFT 42,316 Reputation points Microsoft Employee

2021-02-04T07:31:11.847+00:00

@VirtoDev-7604 Thanks for checking. Our team has not deployed the fix to all production regions yet. The current ETA for the fix to be deployed to most of our regions is by mid February.

For the issue with missing transcriptions you can download a report to check why some of the transcriptions did not get processed. The option should be available against your project and data.

Virtro Dev 21 Reputation points

2021-02-05T02:02:09.707+00:00

Thank you very much
Sign in to comment

Missing apostrophes when uploading human-labeled transcript for custom speech

0 additional answers