question

JensChristianJohannsen-8989 avatar image
0 Votes"
JensChristianJohannsen-8989 asked romungi-MSFT commented

Azure ML web endpoint unreachable after successful deployment

The deployment state of the service is marked as being unhealthy.

Compute target is AKS.

27654-capture.png

The pod is running and the logs says that the init() completed successfully.

Also, when deploying it as a local web service it works.

Model size is small, execution time is < 2 min and we are requesting 0.7 cpu and 0.5 Gb mem. Increasing these requests does not solve it, so guess that it's not related to request limit.

However, when trying to consume the scoring service, an 504 error is returned saying that:

RROR - Received bad response from Model Management Service:
Response Code: 504
Headers: {'Date': 'Wed, 23 Sep 2020 18:44:31 GMT', 'Content-Type': 'text/html', 'Content-Length': '160', 'Connection': 'keep-alive', 'x-request-time': '180.032', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload'}
Content: b'<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body>\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'


Guess this should be fixed on AKS. But what should be done? Any help much appreciated.

Container logs:

27761-capture2.png


azure-machine-learningazure-machine-learning-inference
capture.png (21.8 KiB)
capture2.png (76.1 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@JensChristianJohannsen-8989 The error log indicates a scoring timeout so I think we can increase the scoring_timeout_ms value in your deployment configuration but the existing is already 300s. Also, the score.py can be reviewed to check if there could be any improvements in the script to get the response sooner.

Could you also try to deploy an endpoint with higher cpu cores and memory if it is not possible to optimize the scoring script.

0 Votes 0 ·

Thanks for the comment.

Increasing the scoring_timeout_ms value does not solve the issue. Same for increasing resource request limits for cpu and mem.
The score.py file is OK, as I'm able to run this locally. Also, from the logs, I can see that the init() completes successfully.

I more believe that the error is on the AKS networking configuration.

Are there any way to change the Kubernetes service type from NodePort to ClusterIP, when deploying ?

0 Votes 0 ·
romungi-MSFT avatar image romungi-MSFT JensChristianJohannsen-8989 ·

@JensChristianJohannsen-8989 Just curious to check if your workspace is deployed behind a firewall. If Yes, you might need to whitelist certain Microsoft Hosts on the firewall rules like *.azureml.ms

You can also use an existing AKS cluster if required. This sample notebook to deploy AKS webservice for Azure ML details the steps.


0 Votes 0 ·

0 Answers