您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用读取访问异地冗余存储设计高度可用的应用程序Designing highly available applications using read-access geo-redundant storage

基于云的基础结构(如 Azure 存储)的一个常见功能是提供用于托管应用程序的高度可用平台。A common feature of cloud-based infrastructures like Azure Storage is that they provide a highly available platform for hosting applications. 基于云的应用程序开发人员必须仔细考虑如何利用此平台为其用户提供高度可用的应用程序。Developers of cloud-based applications must consider carefully how to leverage this platform to deliver highly available applications to their users. 本文重点介绍开发人员如何使用 Azure 的异地冗余复制选项之一来确保其 Azure 存储应用程序高度可用。This article focuses on how developers can use one of Azure's geo-redundant replication options to ensure that their Azure Storage applications are highly available.

为异地冗余复制配置的存储帐户将以同步方式复制到主要区域,然后以异步方式复制到数百英里以外的次要区域。Storage accounts configured for geo-redundant replication are synchronously replicated in the primary region, and then asynchronously replicated to a secondary region that is hundreds of miles away. Azure 存储提供两种类型的异地冗余复制:Azure Storage offers two types of geo-redundant replication:

  • 区域冗余存储(GZRS)(预览版)可为需要高可用性和最大持续性的方案提供复制。Geo-zone-redundant storage (GZRS) (preview) provides replication for scenarios requiring both high availability and maximum durability. 使用区域冗余存储(ZRS)以同步方式将数据复制到主区域中的三个 Azure 可用性区域,并将其异步复制到次要区域。Data is replicated synchronously across three Azure availability zones in the primary region using zone-redundant storage (ZRS), then replicated asynchronously to the secondary region. 若要对次要区域中的数据进行读取访问,请启用读取访问权限异地冗余存储(GZRS)。For read access to data in the secondary region, enable read-access geo-zone-redundant storage (RA-GZRS).
  • 异地冗余存储 (GRS) 提供跨区域复制来防范区域性的服务中断。Geo-redundant storage (GRS) provides cross-regional replication to protect against regional outages. 数据将使用本地冗余存储 (LRS) 在主要区域中以同步方式复制三次,然后以异步方式复制到次要区域。Data is replicated synchronously three times in the primary region using locally redundant storage (LRS), then replicated asynchronously to the secondary region. 若要对次要区域中的数据进行读取访问,请启用读取访问异地冗余存储 (RA-GRS)。For read access to data in the secondary region, enable read-access geo-redundant storage (RA-GRS).

本文介绍如何设计应用程序以应对主要区域中发生的服务中断。This article shows how to design your application to handle an outage in the primary region. 如果主要区域不可用,应用程序可以调整为对次要区域执行读取操作。If the primary region becomes unavailable, your application can adapt to perform read operations against the secondary region instead. 在开始之前,请确保已为 GRS 或 GZRS 配置存储帐户。Make sure that your storage account is configured for RA-GRS or RA-GZRS before you get started.

如需深入了解主要区域与次要区域的配对情况,请参阅业务连续性和灾难恢复 (BCDR):Azure 配对区域For information about which primary regions are paired with which secondary regions, see Business continuity and disaster recovery (BCDR): Azure Paired Regions.

本文包含代码片段,末尾有完成示例的链接,可以下载并运行。There are code snippets included in this article, and a link to a complete sample at the end that you can download and run.

从次要区域读取数据时的应用程序设计注意事项Application design considerations when reading from the secondary

本文旨在介绍:如何设计在主数据中心发生重大灾难时仍可继续使用(有限功能)的应用程序。The purpose of this article is to show you how to design an application that will continue to function (albeit in a limited capacity) even in the event of a major disaster at the primary data center. 可以将应用程序设计为在出现问题无法从主要区域读取时,通过从次要区域读取来处理暂时性或长时间运行的问题。You can design your application to handle transient or long-running issues by reading from the secondary region when there is a problem that interferes with reading from the primary region. 当主要区域重新变为可用时,应用程序可恢复为从主要区域读取。When the primary region is available again, your application can return to reading from the primary region.

设计适用于 GRS 或 GZRS 的应用程序时,请记住以下要点:Keep in mind these key points when designing your application for RA-GRS or RA-GZRS:

  • Azure 存储在次要区域中保留主要区域中存储的数据的只读副本。Azure Storage maintains a read-only copy of the data you store in your primary region in a secondary region. 如上所述,存储服务确定次要区域的位置。As noted above, the storage service determines the location of the secondary region.

  • 只读副本与主要区域中的数据最终一致The read-only copy is eventually consistent with the data in the primary region.

  • 对于 blob、表和队列,可以从次要区域查询上次同步时间的值,了解上次从主要区域复制到次要区域的时间。For blobs, tables, and queues, you can query the secondary region for a Last Sync Time value that tells you when the last replication from the primary to the secondary region occurred. (Azure 文件不支持此操作,因为其目前不具有 RA-GRS 冗余。)(This is not supported for Azure Files, which doesn't have RA-GRS redundancy at this time.)

  • 可以使用存储客户端库在主要或次要区域中读取和写入数据。You can use the Storage Client Library to read and write data in either the primary or secondary region. 如果到主要区域的读取请求超时,还可将读取请求自动重定向到次要区域。You can also redirect read requests automatically to the secondary region if a read request to the primary region times out.

  • 如果主要区域变得不可用,则可发起帐户故障转移。If the primary region becomes unavailable, you can initiate an account failover. 故障转移到次要区域时,指向主要区域的 DNS 条目更改为指向次要区域。When you fail over to the secondary region, the DNS entries pointing to the primary region are changed to point to the secondary region. 故障转移完成后,GRS 和 RA-GRS 帐户的写入访问会恢复。After the failover is complete, write access is restored for GRS and RA-GRS accounts. 有关详细信息,请参阅 Azure 存储中的灾难恢复和存储帐户故障转移(预览版)For more information, see Disaster recovery and storage account failover (preview) in Azure Storage.

备注

客户管理的帐户故障转移(预览版)在支持 GZRS/GZRS 的区域中尚不可用,因此客户当前无法使用 GZRS 和 RA GZRS 帐户来管理帐户故障转移事件。Customer-managed account failover (preview) is not yet available in regions supporting GZRS/RA-GZRS, so customers cannot currently manage account failover events with GZRS and RA-GZRS accounts. 在预览期间,Microsoft 将管理影响 GZRS/GZRS 帐户的任何故障转移事件。During the preview, Microsoft will manage any failover events affecting GZRS/RA-GZRS accounts.

使用最终一致的数据Using eventually consistent data

此建议的解决方案假定可以向调用方应用程序返回可能过时的数据。The proposed solution assumes that it is acceptable to return potentially stale data to the calling application. 由于次要区域中的数据是最终一致的,因此在对次要区域的更新完成复制前,主要区域可能会变为不可访问。Because data in the secondary region is eventually consistent, it is possible the primary region may become inaccessible before an update to the secondary region has finished replicating.

例如,假设客户提交更新成功,但在更新传播到次要区域前主要区域发生故障。For example, suppose your customer submits an update successfully, but the primary region fails before the update is propagated to the secondary region. 当客户要求读回数据时,将收到来自次要区域的过时数据而非更新后的数据。When the customer asks to read the data back, they receive the stale data from the secondary region instead of the updated data. 在设计应用程序时,必须确定这是否可接受,如果可接受,如何告知客户。When designing your application, you must decide whether this is acceptable, and if so, how you will message the customer.

本文的后面部分介绍如何查看次要数据的“上次同步时间”,以了解次要区域是否为最新状态。Later in this article, we show how to check the Last Sync Time for the secondary data to check whether the secondary is up-to-date.

单独或整体处理服务Handling services separately or all together

尽管可能性不大,但当其他服务仍然完全可用时,某一个服务可能不可用。While unlikely, it is possible for one service to become unavailable while the other services are still fully functional. 可单独为每个服务(blob、队列、表)处理重试操作或只读模式,或者以一般方式为所有存储服务统一处理重试操作。You can handle the retries and read-only mode for each service separately (blobs, queues, tables), or you can handle retries generically for all the storage services together.

例如,如果在应用程序中使用队列和 blob,可以使用单独的代码处理每个队列或 blob 的重试错误。For example, if you use queues and blobs in your application, you may decide to put in separate code to handle retryable errors for each of these. 然后如果从 blob 服务中重试,但队列服务仍在工作,那么只有应用程序处理 blob 的部分会受到影响。Then if you get a retry from the blob service, but the queue service is still working, only the part of your application that handles blobs will be impacted. 如果决定以一般方式处理所有存储服务重试操作,并使对 blob 服务的调用返回可重试错误,则对 blob 服务和队列服务的请求将受到影响。If you decide to handle all storage service retries generically and a call to the blob service returns a retryable error, then requests to both the blob service and the queue service will be impacted.

从根本上讲,这取决于应用程序的复杂程度。Ultimately, this depends on the complexity of your application. 当检测到主要区域中的任何存储服务存在问题时,可以不按服务处理失败,而改为将对所有存储服务的读取请求重定向到次要区域,并在只读模式下运行应用程序。You may decide not to handle the failures by service, but instead to redirect read requests for all storage services to the secondary region and run the application in read-only mode when you detect a problem with any storage service in the primary region.

其他注意事项Other considerations

本文的其余部分将讨论其他注意事项。These are the other considerations we will discuss in the rest of this article.

  • 使用断路器模式处理读取请求的重试操作Handling retries of read requests using the Circuit Breaker pattern

  • 最终一致的数据和上次同步时间Eventually-consistent data and the Last Sync Time

  • 正在测试Testing

在只读模式下运行应用程序Running your application in read-only mode

若要有效地应对主要区域中发生的服务中断,必须能够处理失败的读取请求和失败的更新请求(此处所谓的更新是指插入、更新和删除)。To effectively prepare for an outage in the primary region, you must be able to handle both failed read requests and failed update requests (with update in this case meaning inserts, updates, and deletions). 如果主要区域发生故障,读取请求可重定向到次要区域。If the primary region fails, read requests can be redirected to the secondary region. 但更新请求不能重定向到备用数据中心,因为备用数据中心是只读的。However, update requests cannot be redirected to the secondary because the secondary is read-only. 因此,需要将应用程序设计为在只读模式下运行。For this reason, you need to design your application to run in read-only mode.

例如,可以设置一个标志,在向 Azure 存储提交任何更新请求前需检查此标志。For example, you can set a flag that is checked before any update requests are submitted to Azure Storage. 当其中一个更新请求成功时,可以跳过它,并向客户返回适当的响应。When one of the update requests comes through, you can skip it and return an appropriate response to the customer. 在问题解决前,甚至可以禁用某些功能,并通知用户这些功能是暂时不可用。You may even want to disable certain features altogether until the problem is resolved and notify users that those features are temporarily unavailable.

如果决定分别处理每个服务的错误,则还需要处理在只读模式下按服务运行应用程序的能力。If you decide to handle errors for each service separately, you will also need to handle the ability to run your application in read-only mode by service. 例如,可以为每个可启用和禁用的服务设置只读标志。For example, you may have read-only flags for each service that can be enabled and disabled. 然后可以在代码中的适当位置处理该标志。Then you can handle the flag in the appropriate places in your code.

无法在只读模式下运行应用程序具有另一个连带好处 - 可在主要应用程序升级期间确保有限的功能。Being able to run your application in read-only mode has another side benefit – it gives you the ability to ensure limited functionality during a major application upgrade. 可以触发应用程序在只读模式下运行并指向备用数据中心,确保在升级时没有任何用户访问主要区域中的数据。You can trigger your application to run in read-only mode and point to the secondary data center, ensuring nobody is accessing the data in the primary region while you're making upgrades.

以只读模式运行时处理更新Handling updates when running in read-only mode

以只读模式运行时,可使用多种方法处理更新请求。There are many ways to handle update requests when running in read-only mode. 我们不会对此进行全面介绍,但通常可考虑以下几种模式。We won't cover this comprehensively, but generally, there are a couple of patterns that you consider.

  1. 可以对用户进行响应,并告知他们当前不接受更新。You can respond to your user and tell them you are not currently accepting updates. 例如,联系人管理系统可使客户访问联系信息但不能进行更新。For example, a contact management system could enable customers to access contact information but not make updates.

  2. 可将更新放入另一区域进行排队。You can enqueue your updates in another region. 在这种情况下,可将挂起的更新请求写入不同区域中的队列,并在主数据中心再次联机后以某种方式处理这些请求。In this case, you would write your pending update requests to a queue in a different region, and then have a way to process those requests after the primary data center comes online again. 在此方案中,应让客户知道更新请求已排队等待稍后处理。In this scenario, you should let the customer know that the update requested is queued for later processing.

  3. 可将更新写入其他区域中的存储帐户。You can write your updates to a storage account in another region. 然后在主数据中心重新联机后,可以某种方式将这些更新合并到主要数据中,具体取决于数据的结构。Then when the primary data center comes back online, you can have a way to merge those updates into the primary data, depending on the structure of the data. 例如,如果使用名称中的日期/时间戳创建单独的文件,可将这些文件复制回主要区域。For example, if you are creating separate files with a date/time stamp in the name, you can copy those files back to the primary region. 此操作适用于某些工作负荷,例如日志记录和 iOT 数据。This works for some workloads such as logging and iOT data.

处理重试操作Handling retries

Azure 存储客户端库可帮助你确定可重试的错误。The Azure Storage client library helps you determine which errors can be retried. 例如,404 错误(找不到资源)是可重试的,因为重试不太可能成功。For example, a 404 error (resource not found) can be retried because retrying it is not likely to result in success. 而 500 错误是不可重试的,因为这属于服务器错误,而且可能只是暂时性问题。On the other hand, a 500 error cannot be retried because it is a server error, and it may simply be a transient issue. 有关详细信息,请参阅 .NET 存储客户端库中的打开 ExponentialRetry 类的源代码For more details, check out the open source code for the ExponentialRetry class in the .NET storage client library. (查找 ShouldRetry 方法。)(Look for the ShouldRetry method.)

阅读请求Read requests

如果主存储存在问题,读取请求可重定向到辅助存储。Read requests can be redirected to secondary storage if there is a problem with primary storage. 如在上文使用最终一致的数据中所述,应用程序必须可潜在读取过时数据。As noted above in Using Eventually Consistent Data, it must be acceptable for your application to potentially read stale data. 如果使用存储客户端库访问次要区域中的数据,可通过将 LocationMode 属性设置为以下之一的值来指定读取请求的重试行为:If you are using the storage client library to access data from the secondary, you can specify the retry behavior of a read request by setting a value for the LocationMode property to one of the following:

  • PrimaryOnly(默认值)PrimaryOnly (the default)

  • PrimaryThenSecondaryPrimaryThenSecondary

  • SecondaryOnlySecondaryOnly

  • SecondaryThenPrimarySecondaryThenPrimary

LocationMode 设置为 PrimaryThenSecondary 时,如果对主终结点的初始读取请求失败且出现可重试的错误,则客户端将自动向辅助终结点发出另一次读取请求。When you set the LocationMode to PrimaryThenSecondary, if the initial read request to the primary endpoint fails with an error that can be retried, the client automatically makes another read request to the secondary endpoint. 如果错误是服务器超时,则客户端需要等待超时到期,才能收到来自服务的可重试错误。If the error is a server timeout, then the client will have to wait for the timeout to expire before it receives a retryable error from the service.

确定如何响应可重试错误时,基本上可考虑两种方案:There are basically two scenarios to consider when you are deciding how to respond to a retryable error:

  • 这是一个隔离的问题,对主终结点的后续请求将不会返回可重试错误。This is an isolated problem and subsequent requests to the primary endpoint will not return a retryable error. 暂时性网络错误就是此情况的示例。An example of where this might happen is when there is a transient network error.

    在此方案中,将 LocationMode 设置为 PrimaryThenSecondary 不会显著影响性能,这种情况很少发生。In this scenario, there is no significant performance penalty in having LocationMode set to PrimaryThenSecondary as this only happens infrequently.

  • 这是主要区域中至少一个存储服务可能出现的问题,对主要区域中该服务的所有后续请求都可能在某一时期内返回可重试错误。This is a problem with at least one of the storage services in the primary region and all subsequent requests to that service in the primary region are likely to return retryable errors for a period of time. 主要区域完全不可访问便是此情况的示例。An example of this is if the primary region is completely inaccessible.

    此方案会对性能产生负面影响,因为所有读取请求将首先尝试主终结点,等待超时过期,然后才能切换到辅助终结点。In this scenario, there is a performance penalty because all your read requests will try the primary endpoint first, wait for the timeout to expire, then switch to the secondary endpoint.

对于这些方案,应注意到主终结点存在一个持续性问题,通过将 LocationMode 属性设置为 SecondaryOnly 可将所有读取请求直接发送到辅助终结点。For these scenarios, you should identify that there is an ongoing issue with the primary endpoint and send all read requests directly to the secondary endpoint by setting the LocationMode property to SecondaryOnly. 此时,还应将应用程序更改为在只读模式下运行。At this time, you should also change the application to run in read-only mode. 此方法称为断路器模式This approach is known as the Circuit Breaker Pattern.

更新请求Update requests

断路器模式还可应用于更新请求。The Circuit Breaker pattern can also be applied to update requests. 但是,更新请求不能重定向到辅助存储,因为辅助存储是只读的。However, update requests cannot be redirected to secondary storage, which is read-only. 对于这些请求,应将 LocationMode 属性设置为 PrimaryOnly(默认值)。For these requests, you should leave the LocationMode property set to PrimaryOnly (the default). 要处理这些错误,可将指标应用于这些请求 - 例如一行中 10 个故障 - 并在达到阈值时,将应用程序转换为只读模式。To handle these errors, you can apply a metric to these requests – such as 10 failures in a row – and when your threshold is met, switch the application into read-only mode. 对于返回到更新模式,可以使用下一部分中描述的关于断路器模式的相同方法。You can use the same methods for returning to update mode as those described below in the next section about the Circuit Breaker pattern.

断路器模式Circuit Breaker pattern

使用应用程序中的断路器模式阻止尝试可能重复失败的操作。Using the Circuit Breaker pattern in your application can prevent it from retrying an operation that is likely to fail repeatedly. 它允许应用程序继续运行,而不是在多次重试操作时占用时间。It allows the application to continue to run rather than taking up time while the operation is retried exponentially. 它还会在错误修复后进行检测,此时应用程序可再次重试操作。It also detects when the fault has been fixed, at which time the application can try the operation again.

如何实现断路器模式How to implement the circuit breaker pattern

若要确定主终结点存在持续性问题,可以监视客户端遇到可重试错误的频率。To identify that there is an ongoing problem with a primary endpoint, you can monitor how frequently the client encounters retryable errors. 由于每种情况都不同,因此需要确定切换到辅助终结点并在只读模式下运行应用程序时使用的阈值。Because each case is different, you have to decide on the threshold you want to use for the decision to switch to the secondary endpoint and run the application in read-only mode. 例如,可决定在一行中存在 10 次失败且没有成功记录时执行转换。For example, you could decide to perform the switch if there are 10 failures in a row with no successes. 另一个示例是在 2 分钟内存在 90% 失败请求时切换。Another example is to switch if 90% of the requests in a 2-minute period fail.

对于第一个方案,只需保留失败的计数,并且如果在达到最大值前成功,则将计数重新设置为零。For the first scenario, you can simply keep a count of the failures, and if there is a success before reaching the maximum, set the count back to zero. 对于第二种方案,一种实现方法是使用 MemoryCache 对象(在 .NET 中)。For the second scenario, one way to implement it is to use the MemoryCache object (in .NET). 对于每个请求,将 CacheItem 添加到缓存,将值设置为成功 (1) 或失败 (0),并将过期时间设置为从现在起 2 分钟(或任意时间约束)。For each request, add a CacheItem to the cache, set the value to success (1) or fail (0), and set the expiration time to 2 minutes from now (or whatever your time constraint is). 当达到条目的过期时间时,会自动删除该条目。When an entry's expiration time is reached, the entry is automatically removed. 这会提供 2 分钟的滚动窗口。This will give you a rolling 2-minute window. 每次向存储服务发起请求时,首先使用跨 MemoryCache 对象的 Linq 查询通过对值进行求和并除以计数来计算成功的百分比。Each time you make a request to the storage service, you first use a Linq query across the MemoryCache object to calculate the percent success by summing the values and dividing by the count. 当成功百分比低于某个阈值(如 10%)时,将读取权限的 LocationMode 属性设置为 SecondaryOnly,并在继续前将应用程序切换到只读模式。When the percent success drops below some threshold (such as 10%), set the LocationMode property for read requests to SecondaryOnly and switch the application into read-only mode before continuing.

用于确定何时切换的错误的阈值根据应用程序中的不同服务而有所差异,因此应考虑将它们设置为可配置参数。The threshold of errors used to determine when to make the switch may vary from service to service in your application, so you should consider making them configurable parameters. 此时还应确定分别还是整体处理可重试错误,如前文所述。This is also where you decide to handle retryable errors from each service separately or as one, as discussed previously.

另一个注意事项是如何处理应用程序的多个实例,以及在每个实例中检测到可重试错误时应如何操作。Another consideration is how to handle multiple instances of an application, and what to do when you detect retryable errors in each instance. 例如,可以运行 20 个加载相同应用程序的 VM。For example, you may have 20 VMs running with the same application loaded. 是否分别处理每个实例?Do you handle each instance separately? 如果实例启动时出现问题,是限制为仅对一个实例作出响应,还是在一个实例出现问题时仍以相同方法对所有实例作出响应?If one instance starts having problems, do you want to limit the response to just that one instance, or do you want to try to have all instances respond in the same way when one instance has a problem? 单独处理实例比尝试协调跨实例的响应简单得多,但具体操作取决于应用程序的体系结构。Handling the instances separately is much simpler than trying to coordinate the response across them, but how you do this depends on your application's architecture.

监视错误频率的选项Options for monitoring the error frequency

可使用三个主要选项监视主要区域中的重试频率,以便确定何时切换到次要区域并将应用程序更改为在只读模式下运行。You have three main options for monitoring the frequency of retries in the primary region in order to determine when to switch over to the secondary region and change the application to run in read-only mode.

  • 为传递到存储请求的 OperationContext 对象上的重试事件添加处理程序 - 这是本文演示的方法,且在随附的示例中使用了该方法。Add a handler for the Retrying event on the OperationContext object you pass to your storage requests – this is the method displayed in this article and used in the accompanying sample. 每当客户端重试请求时都将触发这些事件,以便跟踪客户端在主终结点上遇到可重试错误的频率。These events fire whenever the client retries a request, enabling you to track how often the client encounters retryable errors on a primary endpoint.

    operationContext.Retrying += (sender, arguments) =>
    {
        // Retrying in the primary region
        if (arguments.Request.Host == primaryhostname)
            ...
    };
    
  • 在自定义重试策略的 Evaluate 方法中,每次重试时均可运行自定义代码。In the Evaluate method in a custom retry policy, you can run custom code whenever a retry takes place. 除了在重试时进行记录外,还可利用此操作修改重试行为。In addition to recording when a retry happens, this also gives you the opportunity to modify your retry behavior.

    public RetryInfo Evaluate(RetryContext retryContext,
    OperationContext operationContext)
    {
        var statusCode = retryContext.LastRequestResult.HttpStatusCode;
        if (retryContext.CurrentRetryCount >= this.maximumAttempts
            || ((statusCode >= 300 && statusCode < 500 && statusCode != 408)
            || statusCode == 501 // Not Implemented
            || statusCode == 505 // Version Not Supported
            ))
        {
            // Do not retry
            return null;
        }
    
        // Monitor retries in the primary location
        ...
    
        // Determine RetryInterval and TargetLocation
        RetryInfo info =
            CreateRetryInfo(retryContext.CurrentRetryCount);
    
        return info;
    }
    
  • 第三种方法是在应用程序中实现自定义监视组件,应用程序对具有虚拟读取请求(如读取小型 blob)的主存储终结点持续执行 ping 操作,以确定其运行状况。The third approach is to implement a custom monitoring component in your application that continually pings your primary storage endpoint with dummy read requests (such as reading a small blob) to determine its health. 这会占用一些资源,但占用量不大。This would take up some resources, but not a significant amount. 发现达到阈值的问题时,则切换到 SecondaryOnly 和只读模式。When a problem is discovered that reaches your threshold, you would then perform the switch to SecondaryOnly and read-only mode.

有时,可能想切换回使用主终结点或允许更新。At some point, you will want to switch back to using the primary endpoint and allowing updates. 如果使用上文列出的前两种方法,则只需在任意选择时间长度或操作数量后切换回主终结点并启用更新模式。If using one of the first two methods listed above, you could simply switch back to the primary endpoint and enable update mode after an arbitrarily selected amount of time or number of operations has been performed. 可以再次执行重试逻辑操作。You can then let it go through the retry logic again. 如果问题得到解决,它将继续使用主终结点,并允许更新。If the problem has been fixed, it will continue to use the primary endpoint and allow updates. 如果仍然有问题,它会在无法满足设置的标准后再次重新切换到辅助终结点和只读模式。If there is still a problem, it will once more switch back to the secondary endpoint and read-only mode after failing the criteria you've set.

对于第三个方案,当再次对主存储终结点成功执行 ping 操作时,可触发切换回 PrimaryOnly 并继续允许更新。For the third scenario, when pinging the primary storage endpoint becomes successful again, you can trigger the switch back to PrimaryOnly and continue allowing updates.

处理最终一致的数据Handling eventually consistent data

异地冗余存储的工作方式是将事务从主要区域复制到次要区域。Geo-redundant storage works by replicating transactions from the primary to the secondary region. 此复制过程可确保次要区域中的数据是最终一致的。This replication process guarantees that the data in the secondary region is eventually consistent. 这意味着,主要区域中的所有事务最终将都出现在次要区域中,但可能出现延迟,并且无法确保事物按主要区域中的相同原始顺序到达次要区域。This means that all the transactions in the primary region will eventually appear in the secondary region, but that there may be a lag before they appear, and that there is no guarantee the transactions arrive in the secondary region in the same order as that in which they were originally applied in the primary region. 如果事务未按顺序到达次要区域,则在服务生效前,可以认为次要区域中的数据处于不一致状态。If your transactions arrive in the secondary region out of order, you may consider your data in the secondary region to be in an inconsistent state until the service catches up.

下表显示了更新员工详细信息以使其成为“管理员”角色的成员时可能发生的情况的示例。The following table shows an example of what might happen when you update the details of an employee to make them a member of the administrators role. 此示例要求更新员工条目实体和管理员角色实体的管理员总数。For the sake of this example, this requires you update the employee entity and update an administrator role entity with a count of the total number of administrators. 请注意更新如何以无序方式在次要区域中应用。Notice how the updates are applied out of order in the secondary region.

时间Time 事务Transaction 复制Replication 上次同步时间Last Sync Time 结果Result
T0T0 事务 A:Transaction A:
在主要区域中Insert employee
插入员工实体entity in primary
事务 A 已插入到主要区域,Transaction A inserted to primary,
但尚未复制。not replicated yet.
T1T1 事务 ATransaction A
复制到replicated to
次要区域secondary
T1T1 事物 A 已复制到次要区域。Transaction A replicated to secondary.
已更新“上次同步时间”。Last Sync Time updated.
T2T2 事务 B:Transaction B:
UpdateUpdate
主要区域中的employee entity
员工实体in primary
T1T1 事务 B 已写入主要区域,Transaction B written to primary,
但尚未复制。not replicated yet.
T3T3 事务 C:Transaction C:
更新Update
主要区域中的administrator
中的角色实体role entity in
primaryprimary
T1T1 事务 C 已写入主要区域,Transaction C written to primary,
但尚未复制。not replicated yet.
T4T4 事务 CTransaction C
复制到replicated to
次要区域secondary
T1T1 事物 C 已复制到次要区域。Transaction C replicated to secondary.
LastSyncTime 未更新,因为LastSyncTime not updated because
事务 B 尚未复制。transaction B has not been replicated yet.
T5T5 从次要区域Read entities
读取实体from secondary
T1T1 获取员工实体的过时值,You get the stale value for employee
因为事务 B 尚未entity because transaction B hasn't
复制。replicated yet. 获取管理员角色实体的新值You get the new value for
因为 C 已administrator role entity because C has
复制。replicated. 上次同步时间仍未Last Sync Time still hasn't
更新,因为事务 Bbeen updated because transaction B
尚未复制。hasn't replicated. 可以判断出You can tell the
管理员角色实体不一致administrator role entity is inconsistent
因为实体日期/时间晚于because the entity date/time is after
上次同步时间。the Last Sync Time.
T6T6 事务 BTransaction B
复制到replicated to
辅助secondary
T6T6 T6 - 通过 C 的所有事务都已T6 – All transactions through C have
复制,上次同步时间been replicated, Last Sync Time
已更新。is updated.

在此示例中,假定客户端在 T5 从次要区域切换到读取。In this example, assume the client switches to reading from the secondary region at T5. 它此时能够成功读取管理员角色实体,但该实体包含的管理员数量值与次要区域中此时标记的员工数量不一致。It can successfully read the administrator role entity at this time, but the entity contains a value for the count of administrators that is not consistent with the number of employee entities that are marked as administrators in the secondary region at this time. 客户端只需显示此值,并且具有信息不一致的风险。Your client could simply display this value, with the risk that it is inconsistent information. 或者,客户端可能会尝试确定管理员角色可能是不一致的状态,因为更新是无序进行的,并随后告知用户这一事实。Alternatively, the client could attempt to determine that the administrator role is in a potentially inconsistent state because the updates have happened out of order, and then inform the user of this fact.

要识别它可能具有不一致的数据,客户端可以使用通过随时查询存储服务获取的 上次同步时间 的值。To recognize that it has potentially inconsistent data, the client can use the value of the Last Sync Time that you can get at any time by querying a storage service. 借此可了解次要区域中的数据上一次一致的时间,以及服务在该时间点前应用所有事务的时间。This tells you the time when the data in the secondary region was last consistent and when the service had applied all the transactions prior to that point in time. 在上述示例中,服务在次要区域中插入员工实体后,上次同步时间将设置为 T1In the example shown above, after the service inserts the employee entity in the secondary region, the last sync time is set to T1. 当服务更新次要区域中的员工实体前,它仍然保持为 T1,之后则设置为 T6It remains at T1 until the service updates the employee entity in the secondary region when it is set to T6. 如果客户端在其读取 T5 处的实体时检索上次同步时间,它会将其与实体上的时间戳进行对比。If the client retrieves the last sync time when it reads the entity at T5, it can compare it with the timestamp on the entity. 如果实体上的时间戳晚于上次同步时间,则实体可能处于不一致状态,可对应用程序采取任何适当操作。If the timestamp on the entity is later than the last sync time, then the entity is in a potentially inconsistent state, and you can take whatever is the appropriate action for your application. 使用此字段要求了解到主要区域上次更新的时间。Using this field requires that you know when the last update to the primary was completed.

获取上次同步时间Getting the last sync time

可以使用 PowerShell 或 Azure CLI 检索上次同步时间,以确定上次将数据写入次要区域的时间。You can use PowerShell or Azure CLI to retrieve the last sync time to determine when data was last written to the secondary.

PowerShellPowerShell

若要使用 PowerShell 获取存储帐户的上次同步时间,请安装可支持获取异地复制统计信息的 Azure 存储预览版模块。例如:To get the last sync time for the storage account by using PowerShell, install an Azure Storage preview module that supports getting geo-replication stats. For example:

Install-Module Az.Storage –Repository PSGallery -RequiredVersion 1.1.1-preview –AllowPrerelease –AllowClobber –Force

然后检查存储帐户的 GeoReplicationStats.LastSyncTime 属性。Then check the storage account's GeoReplicationStats.LastSyncTime property. 请务必将占位符值替换为你自己的值:Remember to replace the placeholder values with your own values:

$lastSyncTime = $(Get-AzStorageAccount -ResourceGroupName <resource-group> `
    -Name <storage-account> `
    -IncludeGeoReplicationStats).GeoReplicationStats.LastSyncTime

Azure CLIAzure CLI

若要使用 Azure CLI 获取存储帐户的上次同步时间,请检查存储帐户的 geoReplicationStats.lastSyncTime 属性。To get the last sync time for the storage account by using Azure CLI, check the storage account's geoReplicationStats.lastSyncTime property. 使用 --expand 参数返回在 geoReplicationStats 下嵌套的属性的值。Use the --expand parameter to return values for the properties nested under geoReplicationStats. 请务必将占位符值替换为你自己的值:Remember to replace the placeholder values with your own values:

$lastSyncTime=$(az storage account show \
    --name <storage-account> \
    --resource-group <resource-group> \
    --expand geoReplicationStats \
    --query geoReplicationStats.lastSyncTime \
    --output tsv)

正在测试Testing

当应用程序遇到可重试错误时,请务必测试应用程序的行为是否与预期一致。It's important to test that your application behaves as expected when it encounters retryable errors. 例如,需要测试应用程序在检测到问题时会切换到辅助数据库和只读模式,并在主要区域可用时再次切换回去。For example, you need to test that the application switches to the secondary and into read-only mode when it detects a problem, and switches back when the primary region becomes available again. 若要执行此操作,需以某种方式模拟可重试错误并控制其出现的频率。To do this, you need a way to simulate retryable errors and control how often they occur.

可以使用 Fiddler 在脚本中截获和修改 HTTP 响应。You can use Fiddler to intercept and modify HTTP responses in a script. 此脚本可以标识来自主终结点的响应,并将 HTTP 状态代码更改为存储客户端库识别为可重试错误的代码。This script can identify responses that come from your primary endpoint and change the HTTP status code to one that the Storage Client Library recognizes as a retryable error. 此代码片段显示 Fiddler 脚本的简单示例,此脚本截获响应以读取对 employeedata 表的读取请求,并返回 502 状态:This code snippet shows a simple example of a Fiddler script that intercepts responses to read requests against the employeedata table to return a 502 status:

static function OnBeforeResponse(oSession: Session) {
    ...
    if ((oSession.hostname == "\[yourstorageaccount\].table.core.windows.net")
      && (oSession.PathAndQuery.StartsWith("/employeedata?$filter"))) {
        oSession.responseCode = 502;
    }
}

可使用此示例对范围更广的请求进行截获,并更改其中一些请求的 responseCode 以更好地模拟真实方案。You could extend this example to intercept a wider range of requests and only change the responseCode on some of them to better simulate a real-world scenario. 有关自定义 Fiddler 脚本的详细信息,请参阅 Fiddler 文档中的 Modifying a Request or Response(修改请求或响应)。For more information about customizing Fiddler scripts, see Modifying a Request or Response in the Fiddler documentation.

如果已用于将应用程序切换到只读模式的阈值设置为可配置,则可轻松使用非生产事务量测试行为。If you have made the thresholds for switching your application to read-only mode configurable, it will be easier to test the behavior with non-production transaction volumes.

后续步骤Next Steps