您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

排查 Azure Cache for Redis 超时问题Troubleshoot Azure Cache for Redis timeouts

本部分讨论如何排查连接到 Azure Cache for Redis 时出现的超时问题。This section discusses troubleshooting timeout issues that occur when connecting to Azure Cache for Redis.


本指南中的多个故障排除步骤包括了运行 Redis 命令和监视各种性能指标的说明。Several of the troubleshooting steps in this guide include instructions to run Redis commands and monitor various performance metrics. 如需更多信息和说明,请参阅 其他信息 部分的文章。For more information and instructions, see the articles in the Additional information section.

Redis 服务器修补Redis server patching

Azure Cache for Redis 定期更新其服务器软件,作为它提供的托管服务功能的一部分。Azure Cache for Redis regularly updates its server software as part of the managed service functionality that it provides. 修补活动主要在幕后进行。This patching activity takes place largely behind the scene. 在故障转移期间,当修补 Redis 服务器节点时,连接到这些节点的 Redis 客户端在这些节点之间切换连接时可能会遇到临时超时。During the failovers when Redis server nodes are being patched, Redis clients connected to these nodes may experience temporary timeouts as connections are switched between these nodes. 有关修补可能对应用程序产生的副作用以及如何改进其修补事件处理的详细信息,请参阅故障转移如何影响我的客户端应用程序See How does a failover affect my client application for more information on what side-effects patching can have on your application and how you can improve its handling of patching events.

StackExchange.Redis 超时异常StackExchange.Redis timeout exceptions

StackExchange.Redis 使用名为 synctimeout 的配置设置进行同步操作,该设置的默认值为 5000 毫秒。StackExchange.Redis uses a configuration setting named synctimeout for synchronous operations with a default value of 5000 ms. 如果同步调用未在此时间内完成,StackExchange.Redis 客户端会引发类似于以下示例的超时错误:If a synchronous call doesn’t complete in this time, the StackExchange.Redis client throws a timeout error similar to the following example:

    System.TimeoutException: Timeout performing MGET 2728cc84-58ae-406b-8ec8-3f962419f641, inst: 1,mgr: Inactive, queue: 73, qu=6, qs=67, qc=0, wr=1/1, in=0/0 IOCP: (Busy=6, Free=999, Min=2,Max=1000), WORKER (Busy=7,Free=8184,Min=2,Max=8191)

此错误消息中包含的指标可以指出问题的原因和可能的解决方法。This error message contains metrics that can help point you to the cause and possible resolution of the issue. 下表包含有关错误消息指标的详细信息。The following table contains details about the error message metrics.

错误消息指标Error message metric 详细信息Details
instinst 在最后一个时间切片中:发出了 0 条命令In the last time slice: 0 commands have been issued
mgrmgr 套接字管理器正在执行 socket.select,即,它正在请求 OS 指示一个需要执行某种操作的套接字。The socket manager is doing socket.select, which means it's asking the OS to indicate a socket that has something to do. 读取器并未主动从网络读取数据,因为它认为不需执行任何操作The reader isn't actively reading from the network because it doesn't think there's anything to do
queuequeue 总共有 73 个正在进行的操作There are 73 total in-progress operations
ququ 正在进行的操作中,有 6 个操作位于未发送队列中,而尚未写入到出站网络6 of the in-progress operations are in the unsent queue and haven't yet been written to the outbound network
qsqs 正在进行的操作中,有 67 个操作已发送给服务器,但尚未得到响应。67 of the in-progress operations have been sent to the server but a response isn't yet available. 响应可能为 Not yet sent by the serversent by the server but not yet processed by the client.The response could be Not yet sent by the server or sent by the server but not yet processed by the client.
qcqc 正在进行的操作中,有 0 个操作已看到回复,但尚未标记为完成,因为它们正在完成循环中等待0 of the in-progress operations have seen replies but haven't yet been marked as complete because they're waiting on the completion loop
wrwr 存在活动的写入器(这意味着系统不会忽略这 6 个尚未发送的请求)字节/活动写入器There's an active writer (meaning the 6 unsent requests aren't being ignored) bytes/activewriters
inin 没有活动的读取器,NIC 字节/活动读取器上没有可供读取的字节There are no active readers and zero bytes are available to be read on the NIC bytes/activereaders

可以使用以下步骤调查可能的根本原因。You can use the following steps to investigate possible root causes.

  1. 作为最佳做法,在使用 StackExchange.Redis 客户端时,请确保遵循以下模式进行连接。As a best practice, make sure you're using the following pattern to connect when using the StackExchange.Redis client.

    private static Lazy<ConnectionMultiplexer> lazyConnection = new Lazy<ConnectionMultiplexer>(() =>
        return ConnectionMultiplexer.Connect("cachename.redis.cache.windows.net,abortConnect=false,ssl=true,password=...");
    public static ConnectionMultiplexer Connection
            return lazyConnection.Value;

    有关详细信息,请参阅使用 StackExchange.Redis 连接到缓存For more information, see Connect to the cache using StackExchange.Redis.

  2. 确保服务器和客户端应用程序位于 Azure 中的同一区域。Ensure that your server and the client application are in the same region in Azure. 例如,你可能会在你的缓存位于美国东部但客户端在美国西部,但请求未在此间隔内完成, synctimeout 或者当你从本地开发计算机进行调试时,你可能会遇到超时。For example, you might be getting timeouts when your cache is in East US but the client is in West US and the request doesn't complete within the synctimeout interval or you might be getting timeouts when you're debugging from your local development machine.

    强烈建议将缓存和客户端置于同一 Azure 区域。It’s highly recommended to have the cache and in the client in the same Azure region. 如果方案中包括跨区域调用,则应将 synctimeout 时间间隔设置为比默认的 5000 毫秒时间间隔更高的值,方法是在连接字符串中包括 synctimeout 属性。If you have a scenario that includes cross region calls, you should set the synctimeout interval to a value higher than the default 5000-ms interval by including a synctimeout property in the connection string. 以下示例演示了 Azure Redis 缓存提供的 StackExchange.Redis 连接字符串代码片段,其中的 synctimeout 为 2000 毫秒。The following example shows a snippet of a connection string for StackExchange.Redis provided by Azure Cache for Redis with a synctimeout of 2000 ms.

  3. 确保使用最新版本的 StackExchange.Redis NuGet 包Ensure you using the latest version of the StackExchange.Redis NuGet package. 我们会不断对代码中的 Bug 进行修正,以便更好地应对超时情况。因此,请务必使用最新的版本。There are bugs constantly being fixed in the code to make it more robust to timeouts so having the latest version is important.

  4. 如果请求受服务器或客户端上的带宽限制的约束,则需要更长的时间才能完成,因此可能会导致超时。If your requests are bound by bandwidth limitations on the server or client, it takes longer for them to complete and can cause timeouts. 若要了解超时是否是客户端网络带宽造成的,请参阅服务器端带宽限制To see if your timeout is because of network bandwidth on the server, see Server-side bandwidth limitation. 若要了解超时是否是客户端网络带宽造成的,请参阅客户端带宽限制To see if your timeout is because of client network bandwidth, see Client-side bandwidth limitation.

  5. 操作是否占用了服务器或客户端上的大量 CPU?Are you getting CPU bound on the server or on the client?

    • 检查是否受到了客户端上的 CPU 消耗量的约束。Check if you're getting bound by CPU on your client. 较高的 CPU 可能会导致请求无法在 synctimeout 时间间隔内得到处理,从而导致请求超时。改用更大型客户端或者将负载分散也许有助于控制这种情况。High CPU could cause the request to not be processed within the synctimeout interval and cause a request to time out. Moving to a larger client size or distributing the load can help to control this problem.
    • 监视 CPU 缓存性能指标,检查是否受到了服务器上的 CPU 消耗量的约束。Check if you're getting CPU bound on the server by monitoring the CPU cache performance metric. 如果请求传入时 Redis 受到 CPU 消耗量的约束,则可能会导致这些请求超时。为了解决此问题,可以将负载分散到高级缓存的多个分片中,也可以升级缓存大小或定价层。Requests coming in while Redis is CPU bound can cause those requests to time out. To address this condition, you can distribute the load across multiple shards in a premium cache, or upgrade to a larger size or pricing tier. 有关详细信息,请参阅服务器端带宽限制For more information, see Server-side bandwidth limitation.
  6. 是否存在需要在服务器上进行长时间处理的命令?Are there commands taking long time to process on the server? 在 Redis 服务器上花费很长时间处理请求的命令可能会导致超时。Long-running commands that are taking long time to process on the redis-server can cause timeouts. 有关长时间运行的命令的详细信息,请参阅长时间运行的命令For more information about long-running commands, see Long-running commands. 可以使用 redis-cli 客户端或 Redis 控制台连接到 Azure Redis 缓存实例。You can connect to your Azure Cache for Redis instance using the redis-cli client or the Redis Console. 然后,运行 SLOWLOG 命令查看是否存在比预期速度更慢的请求。Then, run the SLOWLOG command to see if there are requests slower than expected. Redis 服务器和 StackExchange.Redis 适合处理多个小型请求,而不适合处理寥寥数个大型请求。Redis Server and StackExchange.Redis are optimized for many small requests rather than fewer large requests. 将数据拆分成更小的块可能会解决问题。Splitting your data into smaller chunks may improve things here.

    有关使用 redis-cli 和 stunnel 连接到缓存 TLS/SSL 终结点的信息,请参阅博客文章:Announcing ASP.NET Session State Provider for Redis Preview Release(宣布推出适用于 Redis 的 ASP.NET 会话状态提供程序预览版)。For information on connecting to your cache's TLS/SSL endpoint using redis-cli and stunnel, see the blog post Announcing ASP.NET Session State Provider for Redis Preview Release.

  7. Redis 服务器负载过高可能会导致超时。High Redis server load can cause timeouts. 可以通过监视 Redis Server Load 缓存性能指标来监视服务器负载。You can monitor the server load by monitoring the Redis Server Load cache performance metric. 服务器负载值为 100(最大值)表示 Redis 服务器正忙于处理请求,没有空闲时间。A server load of 100 (maximum value) signifies that the redis server has been busy, with no idle time, processing requests. 若要查看某些请求是否占用了服务器的全部处理能力,请按上一段中的说明运行 SlowLog 命令。To see if certain requests are taking up all of the server capability, run the SlowLog command, as described in the previous paragraph. 有关详细信息,请参阅“CPU 使用率/服务器负载过高”。For more information, see High CPU usage / Server Load.

  8. 客户端上是否存在其他可能导致网络故障的事件?Was there any other event on the client side that could have caused a network blip? 常见事件包括:增加或减少客户端实例的数目、部署新的客户端版本,或启用自动缩放。Common events include: scaling the number of client instances up or down, deploying a new version of the client, or autoscale enabled. 我们在测试中发现,自动缩放或扩展/缩减可能会导致出站网络连接断开几秒。In our testing, we have found that autoscale or scaling up/down can cause outbound network connectivity to be lost for several seconds. StackExchange.Redis 代码可以灵活应对此类事件,并且会重新连接。StackExchange.Redis code is resilient to such events and reconnects. 重新连接时,队列中的所有请求可能超时。While reconnecting, any requests in the queue can time out.

  9. 在向缓存发出多个小型请求之前,是否存在导致超时的大型请求?Was there a large request preceding several small requests to the cache that timed out? 错误消息中的参数 qs 会告知,有多少个请求已从客户端发送到服务器,但尚未处理响应。The parameter qs in the error message tells you how many requests were sent from the client to the server, but haven't processed a response. 此值可能会持续增加,因为 StackExchange.Redis 使用单个 TCP 连接,一次只能读取一个响应。This value can keep growing because StackExchange.Redis uses a single TCP connection and can only read one response at a time. 即使第一个操作超时,也不会阻止与服务器之间来回发送数据。Even though the first operation timed out, it doesn't stop more data from being sent to or from the server. 在完成大型请求之前,系统会阻止其他请求,从而可能导致超时。Other requests will be blocked until the large request is finished and can cause time outs. 降低超时概率的一种解决方案是确保缓存对于工作负荷来说足够大,并将大的值拆分成较小的块。One solution is to minimize the chance of timeouts by ensuring that your cache is large enough for your workload and splitting large values into smaller chunks. 另一种可能的解决方案是使用客户端中的 ConnectionMultiplexer 对象池,在发送新请求时选择负载最小的 ConnectionMultiplexerAnother possible solution is to use a pool of ConnectionMultiplexer objects in your client, and choose the least loaded ConnectionMultiplexer when sending a new request. 通过多个连接对象进行加载应该可以防止单次超时导致其他请求也发生超时。Loading across multiple connection objects should prevent a single timeout from causing other requests to also time out.

  10. 如果使用 RedisSessionStateProvider,请确保正确设置重试超时。If you're using RedisSessionStateProvider, ensure you have set the retry timeout correctly. retryTimeoutInMilliseconds 应高于 operationTimeoutInMilliseconds,否则无法执行重试。retryTimeoutInMilliseconds should be higher than operationTimeoutInMilliseconds, otherwise no retries occur. 在下面的示例中, retryTimeoutInMilliseconds 设置为 3000。In the following example retryTimeoutInMilliseconds is set to 3000. 有关详细信息,请参阅 Azure Redis 缓存的 ASP.NET 会话状态提供程序How to use the configuration parameters of Session State Provider and Output Cache Provider(如何使用会话状态提供程序和输出缓存提供程序的配置参数)。For more information, see ASP.NET Session State Provider for Azure Cache for Redis and How to use the configuration parameters of Session State Provider and Output Cache Provider.

      connectionTimeoutInMilliseconds = "5000"
      operationTimeoutInMilliseconds = "1000"
      retryTimeoutInMilliseconds="3000" />
  11. 通过监视 Used Memory RSSUsed Memory,检查 Azure Cache for Redis 服务器上的内存使用情况。Check memory usage on the Azure Cache for Redis server by monitoring Used Memory RSS and Used Memory. 如果实施了逐出策略,则当 Used_Memory 达到缓存大小时,Redis 就会开始逐出密钥。If an eviction policy is in place, Redis starts evicting keys when Used_Memory reaches the cache size. 理想情况下,Used Memory RSS 应只稍高于 Used memoryIdeally, Used Memory RSS should be only slightly higher than Used memory. 差异过大意味着出现内存碎片(内部或外部)。A large difference means there's memory fragmentation (internal or external). 如果 Used Memory RSS 小于 Used Memory,则意味着部分缓存内存已由操作系统更换。When Used Memory RSS is less than Used Memory, it means part of the cache memory has been swapped by the operating system. 如果发生这种情况,则会出现明显的延迟。If this swapping occurs, you can expect some significant latencies. 由于 Redis 无法控制如何将其分配映射到内存页,Used Memory RSS 偏高通常是内存用量出现高峰而造成的。Because Redis doesn't have control over how its allocations are mapped to memory pages, high Used Memory RSS is often the result of a spike in memory usage. 当 Redis 服务器释放内存时,分配器会回收该内存,但不一定会将内存重新分配到系统。When Redis server frees memory, the allocator takes the memory but it may or may not give the memory back to the system. Used Memory 值与操作系统所报告的内存消耗量可能存在差异。There may be a discrepancy between the Used Memory value and memory consumption as reported by the operating system. 内存可能已由 Redis 使用并释放,但尚未重新分配到系统。Memory may have been used and released by Redis but not given back to the system. 为了帮助缓解内存问题,可执行以下步骤:To help mitigate memory issues, you can do the following steps:

    • 将缓存升级到更大的大小,以免系统中的内存限制导致运行受阻。Upgrade the cache to a larger size so that you aren't running against memory limitations on the system.
    • 对密钥设置过期时间,以便主动逐出过旧的值。Set expiration times on the keys so that older values are evicted proactively.
    • 监视 used_memory_rss 缓存指标。Monitor the used_memory_rss cache metric. 当此值接近缓存大小时,可能会开始出现性能问题。When this value approaches the size of their cache, you're likely to start seeing performance issues. 将数据分配到多个分片(如果使用高级缓存),或者升级到更大的缓存大小。Distribute the data across multiple shards if you're using a premium cache, or upgrade to a larger cache size.

    有关详细信息,请参阅 Redis 服务器上的内存压力For more information, see Memory pressure on Redis server.

其他信息Additional information