question

CreidLee-3246 avatar image
0 Votes"
CreidLee-3246 asked CreidLee-3246 edited

Azure Document Translation cannot translate pdf files

I am developing an web app that can translate pdf files by using Azure Document Translation and facing a blocking issue as follows.

Each translation request for a pdf file is submitted and then translation status is always failed with document status constantly indicating invalid document due to corruption or unsupported type/extension. The following is examples of the request and corresponding unsuccessful response.

request:
{
"inputs": [
{
"storageType": "File",
"source": {
"sourceUrl": "https://my.blob.core.windows.net/public-file/a95e8311b924453ea18fd735cdb9535c.pdf?sv=2020-04-08&st=2021-08-27T09%3A13%3A18Z&se=2021-08-28T09%3A13%3A18Z&sr=b&sp=r&sig=cTztb3rwYzFreC9fh81IfVFefwAJFGfi438fCVomcSM%3D"
},
"targets": [
{
"targetUrl": "https://my.blob.core.windows.net/public-file/a95e8311b924453ea18fd735cdb9535c_translated.pdf?sp=wl&st=2021-08-27T09:14:41Z&se=2021-08-28T09:14:41Z&sv=2020-08-04&sr=c&sig=D4RrSiL0bGdy%2BXNCUYEm94h0CHzNCv1%2FN%2B7nsDxRcY0%3D",
"language": "zh-Hans"
}
]
}
]
}

response:
{
"value": [
{
"sourcePath": "https://my.blob.core.windows.net/public-file/a95e8311b924453ea18fd735cdb9535c.pdf",
"createdDateTimeUtc": "2021-08-27T09:56:58.5368104Z",
"lastActionDateTimeUtc": "2021-08-27T09:57:07.3741074Z",
"status": "Failed",
"to": "zh-Hans",
"error": {
"code": "InvalidRequest",
"message": "Document failed during checking validity. This may be caused by corruption or unsupported type/extension.",
"target": "Document",
"innerError": {
"code": "InvalidDocument",
"message": "Document failed during checking validity. This may be caused by corruption or unsupported type/extension."
}
},
"progress": 0,
"id": "273622bd-835c-4946-9798-fd8f19f6bbf2",
"characterCharged": 0
}
]
}

I checked every single pdf file uploaded to Azure Blob storage. Each is openable and no one is found corrupt. I tried with other format files like text or Word files. They can be translated while pdf files seem to be only format that cannot. Is the service temporarily unavailable for translating pdf files or are there any required paramters for pdf translation missing in the request?

azure-translator
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

romungi-MSFT avatar image
0 Votes"
romungi-MSFT answered CreidLee-3246 edited

@CreidLee-3246 From the service perspective, I tested with one of my PDF documents and the file is translated correctly.

127127-image.png

I feel the issue might be with the shared access signature(SAS) because in the response it seems to indicate the source file path and did not seem to pickup the signature correctly. If I lookup your URL the following seems incorrect:

 ?sv=2020-04-08&st=2021-08-27T09%3A13%3A18Z&se=2021-08-28T09%3A13%3A18Z

Shouldn't it be the following:

 ?sv=2020-04-08&st=2021-08-27T09:13:18Z&se=2021-08-28T09:13:18Z

Seems like ':' is replaced with %3A in UTF format

The target URL though seems to have got it right.

Could you please re-check the signature and try the scenario again?




image.png (12.0 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks for your troubleshooting. Finally, I found the root cause. It is FILE ENCRYPTION. The pdf files used to test document translation are encrypted with extraction prohibited but openable without password by using pdf reader applications. It is very likely document translation cannot extract text from encrypted files and then throws exceptions indicating files cannot be translated. However, the response message is misguiding and I was not able to troubleshoot straightforward. As supported formats include office file formats which may be encrypted as well, suggest the API gives more concrete error messages in the case of translating encrypted files.

0 Votes 0 ·