question

AnkitRawat-4616 avatar image
0 Votes"
AnkitRawat-4616 asked YutongTie-MSFT commented

ML model deployment issue

I am trying to deploy an ML classification model on Azure using GUI.

After registering/uploading the model inside the portal, I am deploying the model in the Azure container instance, with custom entry_script and the conda dependencies.

Entry Script

 # Importing Pacakges
 import pandas as pd
 import pickle
 import regex, json
 import numpy as np
 import sklearn
 import os
    
 from inference_schema.schema_decorators import input_schema, output_schema
 from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
    
 def init():
     global model
     global classes
     model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'randomForest50.pkl')
     model = pickle.load(open(model_path, "rb"))
     classes = lambda x : ["F", "M"][x]
    
 input_sample = np.array([['Thomas', 'Anna']])
 output_sample = np.array(['m', 'F'])
    
    
 @input_schema('data', NumpyParameterType(input_sample))
 @output_schema(NumpyParameterType(output_sample))
 def run(data):
     try:
         namesList = json.loads(data)["data"]["names"]
         pred = list(map(classes, model.predict(preprocessing(namesList))))
         return str(pred[0])
     except Exception as e:
         error = str(e)
         return error


Conda.yaml

 name: prediction
 dependencies:
 - python=3.7
 - numpy
 - scikit-learn
 - pip:
     - azureml-defaults
     - pandas
     - pickle4
     - regex
     - inference-schema[numpy-support]   

After deployment, the endpoint deployment state goes to unhealthy. and the logs show that program is stuck in a loop. Check logs below:

 2021-04-26T08:14:55,433967500+00:00 - rsyslog/run 
 2021-04-26T08:14:55,421414500+00:00 - iot-server/run 
 2021-04-26T08:14:55,540534600+00:00 - gunicorn/run 
 2021-04-26T08:14:55,646209100+00:00 - nginx/run 
 EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
 2021-04-26T08:14:58,234212800+00:00 - iot-server/finish 1 0
 2021-04-26T08:14:58,324505300+00:00 - Exit code 1 is normal. Not restarting iot-server.
 Starting gunicorn 19.9.0
 Listening at: http://127.0.0.1:31311 (62)
 Using worker: sync
 worker timeout is set to 300
 Booting worker with pid: 89
 SPARK_HOME not set. Skipping PySpark Initialization.
 Initializing logger
 2021-04-26 08:15:11,623 | root | INFO | Starting up app insights client
 2021-04-26 08:15:11,624 | root | INFO | Starting up request id generator
 2021-04-26 08:15:11,631 | root | INFO | Starting up app insight hooks
 2021-04-26 08:15:11,632 | root | INFO | Invoking user's init function
 worker timeout is set to 300
 Booting worker with pid: 91
 SPARK_HOME not set. Skipping PySpark Initialization.
 Initializing logger
 2021-04-26 08:15:29,014 | root | INFO | Starting up app insights client
 2021-04-26 08:15:29,014 | root | INFO | Starting up request id generator
 2021-04-26 08:15:29,014 | root | INFO | Starting up app insight hooks
 2021-04-26 08:15:29,014 | root | INFO | Invoking user's init function
 worker timeout is set to 300
 Booting worker with pid: 98
 SPARK_HOME not set. Skipping PySpark Initialization.
 ...
 ...
 ...


I tried to deploy the model using python also. But it also failed with message:

 WebserviceException: WebserviceException:
  Message: Service deployment polling reached non-successful terminal state, current service state: Failed
 Operation ID: 98e464d4-5b15-4606-936f-a2625f7bd1fd
 More information can be found using '.get_logs()'
 Error:
 {
   "code": "AciDeploymentFailed",
   "statusCode": 400,
   "message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\n\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\n\t2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.",
   "details": [
     {
       "code": "CrashLoopBackOff",
       "message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\n\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\n\t2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information."
     },
     {
       "code": "AciDeploymentFailed",
       "message": "Your container application crashed. Please follow the steps to debug:\n\t1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.\n\t2. If your container application crashed. This may be caused by errors in your scoring file's init() function. You can try debugging locally first. Please refer to https://aka.ms/debugimage#debug-locally for more information.\n\t3. You can also interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t4. View the diagnostic events to check status of container, it may help you to debug the issue.\n\"RestartCount\": 3\n\"CurrentState\": {\"state\":\"Waiting\",\"startTime\":null,\"exitCode\":null,\"finishTime\":null,\"detailStatus\":\"CrashLoopBackOff: Back-off restarting failed\"}\n\"PreviousState\": {\"state\":\"Terminated\",\"startTime\":\"2021-04-27T10:46:03.903Z\",\"exitCode\":111,\"finishTime\":\"2021-04-27T10:46:07.524Z\",\"detailStatus\":\"Error\"}\n\"Events\":\n{\"count\":1,\"firstTimestamp\":\"2021-04-27T10:42:37Z\",\"lastTimestamp\":\"2021-04-27T10:42:37Z\",\"name\":\"Pulling\",\"message\":\"pulling image \\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\"\",\"type\":\"Normal\"}\n{\"count\":1,\"firstTimestamp\":\"2021-04-27T10:44:15Z\",\"lastTimestamp\":\"2021-04-27T10:44:15Z\",\"name\":\"Pulled\",\"message\":\"Successfully pulled image \\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\"\",\"type\":\"Normal\"}\n{\"count\":4,\"firstTimestamp\":\"2021-04-27T10:44:40Z\",\"lastTimestamp\":\"2021-04-27T10:46:03Z\",\"name\":\"Started\",\"message\":\"Started container\",\"type\":\"Normal\"}\n{\"count\":4,\"firstTimestamp\":\"2021-04-27T10:44:43Z\",\"lastTimestamp\":\"2021-04-27T10:46:07Z\",\"name\":\"Killing\",\"message\":\"Killing container with id 5c5ddb266c4b38b1c306367712d9bec0687e5f6979e34afea7f6b943edf7db75.\",\"type\":\"Normal\"}\n"
     }
   ]
 }
  InnerException None
  ErrorResponse 
 {
     "error": {
         "message": "Service deployment polling reached non-successful terminal state, current service state: Failed\nOperation ID: 98e464d4-5b15-4606-936f-a2625f7bd1fd\nMore information can be found using '.get_logs()'\nError:\n{\n  \"code\": \"AciDeploymentFailed\",\n  \"statusCode\": 400,\n  \"message\": \"Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\\n\\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\\n\\t2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\n\\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.\",\n  \"details\": [\n    {\n      \"code\": \"CrashLoopBackOff\",\n      \"message\": \"Your container application crashed. This may be caused by errors in your scoring file's init() function.\\n\\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\\n\\t2. You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\n\\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.\"\n    },\n    {\n      \"code\": \"AciDeploymentFailed\",\n      \"message\": \"Your container application crashed. Please follow the steps to debug:\\n\\t1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.\\n\\t2. If your container application crashed. This may be caused by errors in your scoring file's init() function. You can try debugging locally first. Please refer to https://aka.ms/debugimage#debug-locally for more information.\\n\\t3. You can also interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\n\\t4. View the diagnostic events to check status of container, it may help you to debug the issue.\\n\\\"RestartCount\\\": 3\\n\\\"CurrentState\\\": {\\\"state\\\":\\\"Waiting\\\",\\\"startTime\\\":null,\\\"exitCode\\\":null,\\\"finishTime\\\":null,\\\"detailStatus\\\":\\\"CrashLoopBackOff: Back-off restarting failed\\\"}\\n\\\"PreviousState\\\": {\\\"state\\\":\\\"Terminated\\\",\\\"startTime\\\":\\\"2021-04-27T10:46:03.903Z\\\",\\\"exitCode\\\":111,\\\"finishTime\\\":\\\"2021-04-27T10:46:07.524Z\\\",\\\"detailStatus\\\":\\\"Error\\\"}\\n\\\"Events\\\":\\n{\\\"count\\\":1,\\\"firstTimestamp\\\":\\\"2021-04-27T10:42:37Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:42:37Z\\\",\\\"name\\\":\\\"Pulling\\\",\\\"message\\\":\\\"pulling image \\\\\\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\\\\\"\\\",\\\"type\\\":\\\"Normal\\\"}\\n{\\\"count\\\":1,\\\"firstTimestamp\\\":\\\"2021-04-27T10:44:15Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:44:15Z\\\",\\\"name\\\":\\\"Pulled\\\",\\\"message\\\":\\\"Successfully pulled image \\\\\\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\\\\\"\\\",\\\"type\\\":\\\"Normal\\\"}\\n{\\\"count\\\":4,\\\"firstTimestamp\\\":\\\"2021-04-27T10:44:40Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:46:03Z\\\",\\\"name\\\":\\\"Started\\\",\\\"message\\\":\\\"Started container\\\",\\\"type\\\":\\\"Normal\\\"}\\n{\\\"count\\\":4,\\\"firstTimestamp\\\":\\\"2021-04-27T10:44:43Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:46:07Z\\\",\\\"name\\\":\\\"Killing\\\",\\\"message\\\":\\\"Killing container with id 5c5ddb266c4b38b1c306367712d9bec0687e5f6979e34afea7f6b943edf7db75.\\\",\\\"type\\\":\\\"Normal\\\"}\\n\"\n    }\n  ]\n}"
     }
 }

I have deployed the same model with the same entryScript.py and the same conda.yaml previously, and it worked fine.

I cannot figure out what can be the issue here. Can anybody please suggest to me something for solving this?







azure-machine-learningazure-machine-learning-studio-classicazure-machine-learning-inference
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello,

Please let us know if you have more question, it will be helpful to accept the answer if you feel it helps. Thanks.


Regards,
Yutong

0 Votes 0 ·

1 Answer

YutongTie-MSFT avatar image
0 Votes"
YutongTie-MSFT answered

Hello,

Thanks for reaching out to us. Based on the log, it seems your container application crashed and this may be caused by errors in your scoring file's init() function.

You can run service.get_logs() to get log information from the unhealthy service to see what's causing it to fail. Please refer to https://aka.ms/debugimage#debug-locally for more information.

You can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information. View the diagnostic events to check status of container, it may help you to debug the issue.

You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.

More information will help to find out the reason.

Regards,
Yutong

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.