question

DanMarculescu-1199 avatar image
1 Vote"
DanMarculescu-1199 asked DanMarculescu-1199 commented

Deployemnt Time out error in AKS and Endpoint stuck in "Transitioning" state.

Working on the deployment of 170 ML models using ML studio and azure Kubernetes service which is referred on the below doc link "https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/machine-learning/how-to-deploy-azure-kubernetes-service.md".

We are training the model using python script with the custom environment and we are registering the ml model on the Azure ML services. Once we register the mode we are deploying it on the AKS by using the container images.

While deploying the ML model we are able to deploy up to 10 to 11 models per pod for each Node in AKS. When we try to deploy the model on the same node we are getting deployment timeout error and we are getting the below error message.

129464-deployment-error.png


For deploying the model in Azure Kubernetes Service using python language with below sample code.


  #  Create an environment and add conda dependencies to it and for this creating our environment and building the custom container image.
         myenv = Environment(name = Deployment_name)
         myenv.python.conda_dependencies = CondaDependencies.create(pip_packages)
        
            
     #  Inference_Conifiguration
         inf_config = InferenceConfig(environment= myenv, entry_script='./Script_file.py')
        
        
     # Deployment_Conifiguration
         deployment_config = AksWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1, cpu_cores_limit = 2, memory_gb_limit = 2, traffic_percentile = 10)
        
     #  AKS cluster compute target 
         aks_target = ComputeTarget(ws, 'pipeline')
           
        
    #  Deploying the model in AKS server
           service = Model.deploy(ws, Deployment_name, model_1, inf_config,
                       deployment_config, aks_target, overwrite=True)
        
            service.wait_for_deployment(show_output=True)

We also checked on the azure documentation and we could able to find any configuration or deployment setup for aks nodes.


Can you please provide us more clarification regarding "The number of models to be deployed is limited to 1,000 models per deployment (per container)" and Can you please give insight/feedback on how to increase the number of ml models that can be deployed in each node in Azure Kubernetes Service? Thanks!

azure-machine-learningazure-kubernetes-serviceazure-container-registry
deployment-error.png (230.8 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

shivapatpi-MSFT avatar image
1 Vote"
shivapatpi-MSFT answered DanMarculescu-1199 commented

Hello @DanMarculescu-1199 ,
Can you kindly take a look at the similar post which was answered with relevant documentation .
https://docs.microsoft.com/en-us/answers/questions/540001/how-many-models-can-be-deployed-in-single-node-in.html

Let us know if that helps !

Regards,
Shiva.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello, @shivapatpi-MSFT Thanks for the reply,
We have tried using the same steps which they mentioned in the above similar post. we can deploy 2 ML models on the same container, and then we are using AKS clusters for the deployment.
We are trying to deploy 171 deployments on the AKS clusters and each deployment has 2 ML models in it. We are able to do between 10 to 11 deployments on a single node of clusters. when we are deploying more than 10 to 11 deployments on AKS we are getting the deployment timeout error. Currently, we have 16 nodes for deploying 160 deployments on the AKS cluster. we are trying to reduce the node count on the AKS by increasing the deployment count on the single AKS clusters.

We have also checked on other documentation and post which is posted on the community and tried their solution as well. But still, we are getting the same error.

Also, we are looking for how many deployments can be done on a single node in clusters?

0 Votes 0 ·