Response times are very slow for the service

Karina Gerasimovich 10 Reputation points
2023-12-16T09:04:11.4366667+00:00

Can you please advise what is the root cause for the following?

The Azure OpenAI service response times are very slow,

taking up to ~28 seconds to respond to a single request.

This same performance issue exists in all our corresponding Azure OpenAI subscriptions & accounts.
UK South ~15 seconds, all other locations ~25

  1. No need for streaming because the difference for the same process 10+ seconds
  2. We can just provide rows where LLM takes 25 and 15 seconds
    Log quote:
    "To run LLM - 26.32 sec - Canada East
     To run LLM - 14.09 sec - UK South"
Azure AI Bot Service
Azure AI Bot Service
An Azure service that provides an integrated environment for bot development.
751 questions
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,413 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,219 questions
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,408 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,418 questions
{count} votes

1 answer

Sort by: Most helpful
  1. navba-MSFT 17,365 Reputation points Microsoft Employee
    2023-12-18T05:03:46.57+00:00

    @Karina Gerasimovich Apologies for the late reply. Welcome to Microsoft Q&A Forum, Thank you for posting your query here!
    .

    Questions:
    Could you please let me know the GPT model version you are using ?

    Also could you please provide prompt size, max token set ?

    Please check the latency metrics and let me know which API operation is consuming time. Open the Azure OpenAI resource from your portal. Navigate to the metrics section and apply the splitting for the latency metrics and check which API / operationName was time consuming ?

    User's image

    Also please check the Time to response metrics by applying splitting:

    User's image

    Note:
    If you are using GPT4 model then
    latency is expected considering that gpt-4 has more capacity than the gpt-3.5 version.
    As of now, we do not offer Service Level Agreements (SLAs) for response times from the Azure OpenAI service.
    .

    Action Plan:
    This article talks about Azure OpenAI service about improving the latency performance.

    Here are some of the best practices to lower latency:

    • Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
    • Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
    • Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
    • Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
    • Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

    Please let me know if you have any follow-up questions. I would be happy to answer it.
    .

    Awaiting your reply.

    0 comments No comments