question

SureshBettadapur-4155 avatar image
0 Votes"
SureshBettadapur-4155 asked romungi-MSFT answered

Autoscaling issue with AKS attached as ML inference cluster

We have a AKS cluster defined inside a VNet. This AKS cluster is used as an Inference cluster for Azure ML and models are deployed on the AKS from ML. Due to this reason, AKS cluster of "loadBalancer" outbound type is created which creates a load balancer of Public IP [a requirement of ML]

As this is inside a VNet, PublicIP is not routable and to access the scoring endpoints deployed on AKS, we have created a NGINX Ingress controller, with an Internal IP

Now, everything is working fine, but PODs (and Nodes) aren't Autoscaling. Have enabled cluster autoscaler and per MS advice, not enabled HPA.

What could be the reason? Can you please advice?

azure-machine-learningazure-kubernetes-service
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

romungi-MSFT avatar image
0 Votes"
romungi-MSFT answered

@SureshBettadapur-4155 I think the number of nodes is not increased as part of autoscaling for an AKS deployment of Azure ml model. Please refer the note here in the documentation.
The azureml-fe component scales the number of replicas for the model within the physical cluster boundaries. I think in your case the Azure ML router azureml-fe might not be started.
Have you tried to check if autoscaling works when you deploy to a new cluster which is created from the azure ml portal with advanced network settings that enables you to select the virtual network instead of using an existing AKS cluster?


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

SureshBettadapur-4155 avatar image
0 Votes"
SureshBettadapur-4155 answered

Thanks for the response

I can see azureml-fe service in default namespace when I run the below command

kubectl get svc -A

default azureml-fe LoadBalancer <Cluster-IP> <External-IP>

POD autoscaling is also not happening

ML service and AKS cluster are created using Terraform scripts. Existing AKS cluster is attached as Inference cluster

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.