question

RahulSivankutty-0003 avatar image
0 Votes"
RahulSivankutty-0003 asked srbose-msft rolled back

No response from API for long running jobs in ACI or AKS

I deployed my container image on ACI, which consists of an API created using FastAPI. We use this API to trigger some machine learning tasks in the backend. When the run time is less, I get the response from the API but when the run time is greater than ~300 seconds, I'm not getting any response from the API once the task is completed. When I tested this on my local everything works fine, I doubt there is something to deal with ACI. The same issue happens when I deploy the image in AKS also.
Is there any job completion time limit in ACI and AKS when the job is triggered from API?

azure-kubernetes-serviceazure-container-instances
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@RahulSivankutty-0003 , Thank you for your question.

Currently there is Azure limit set for an amount of time before which the job has to complete on ACI or AKS.

However, you can set .spec.backOffLimit for the job. Reference

.spec.activeDeadlineSeconds is another kubernetes feature that can be used limit the amount of time taken by the job to complete. Note that a Job's .spec.activeDeadlineSeconds takes precedence over its .spec.backoffLimit. Reference

I am assuming for long running jobs it is your API which isn't responding after job completion. Please correct me if you meant the Kubernetes API by any chance.

Can you please check the resource consumption of the job pods (on AKS with kubectl top po) and if they are running to completion or if there are any errors shown in the logs? [How-to guide]


0 Votes 0 ·

Hi @srbose-msft , Thank you for the replay
For long-running jobs, I can see the job getting completed in the backend and the result is also getting printed. And the resource consumption is not making any effect on this, because I tried creating a dummy endpoint that runs for a given number of seconds and returns a dummy response. If this time is greater than 240 seconds, then I don't get a response in ACI, but all the jobs which complete in 240 seconds or less return the response. And if it is an issue with the API, it shouldn't work on my local, right? But whatever the run time I'm able to get the response from my API when I run it in my local container.

Any idea what is the run-time limit in ACI and AKS? I guess it is 4 minutes for ACI. And what is the maximum limit I can increase this time to?

0 Votes 0 ·
srbose-msft avatar image
0 Votes"
srbose-msft answered

@RahulSivankutty-0003 ,

That said, this looks like a textbook use case for the Async Request/Reply pattern i.e. decouple the frontend API from the backend processing. Reference
If a period of inactivity is longer than the timeout value, there's no guarantee that the TCP or HTTP session is maintained. A common practice is to use a TCP keep-alive. This practice keeps the connection active for a longer period. For more information, see the .NET examples.

Please do let me know if this helps.



· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@srbose-msft ,
I think this is what I should do, "decouple the frontend API from the backend processing". Because our backend runtime will sometimes go beyond 30 minutes also. Thank you for the response.

0 Votes 0 ·
srbose-msft avatar image
0 Votes"
srbose-msft answered

@RahulSivankutty-0003 , Thank you for your response.

Now that you mention 240s (4 minutes), we suspect this is down to the TCP reset/idle timeout for the Azure Load Balancer. Reference

This is currently not configurable in ACI not deployed in a Virtual Network, but is possible in AKS using an annotation on the Load Balancer. Reference
If you are using Azure Firewall to route traffic to the Azure Container Instance, Azure Firewall TCP Idle Timeout is four minutes. This setting isn't user configurable, but you can contact Azure Support to increase the idle timeout up to 30 minutes. Reference


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.