您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

适用于 Apache HBase 的 Azure HDInsight 加速写入Azure HDInsight Accelerated Writes for Apache HBase

本文提供了有关 Azure HDInsight 中 Apache HBase 的加速写入功能的背景知识,以及如何有效地使用它来提高写入性能。This article provides background on the Accelerated Writes feature for Apache HBase in Azure HDInsight, and how it can be used effectively to improve write performance. 加速写入使用Azure 高级 SSD 托管磁盘来改善 Apache HBase 写入日志的性能(WAL)。Accelerated Writes uses Azure premium SSD managed disks to improve performance of the Apache HBase Write Ahead Log (WAL). 若要了解有关 Apache HBase 的详细信息,请参阅什么是 HDInsight 中的 Apache HBaseTo learn more about Apache HBase, see What is Apache HBase in HDInsight.

HBase 体系结构概述Overview of HBase architecture

在 HBase 中,一行由一个或多个组成,由行键标识。In HBase, a row consists of one or more columns and is identified by a row key. 多行构成一个Multiple rows make up a table. 列包含单元,它们是该列中值的有时间戳的版本。Columns contain cells, which are timestamped versions of the value in that column. 列按列系列分组,列系列中的所有列一起存储在名为hfile的存储文件中。Columns are grouped into column families, and all columns in a column-family are stored together in storage files called HFiles.

HBase 中的区域用于平衡数据处理负载。Regions in HBase are used to balance the data processing load. HBase 首先将表中的行存储在单个区域中。HBase first stores the rows of a table in a single region. 当表中的数据量增加时,这些行将分散到多个区域。The rows are spread across multiple regions as the amount of data in the table increases. 区域服务器可以处理多个区域的请求。Region Servers can handle requests for multiple regions.

为 Apache HBase 编写提前日志Write Ahead Log for Apache HBase

HBase 首先将数据更新写入一种称为 "提前写入日志" (WAL)的提交日志。HBase first writes data updates to a type of commit log called a Write Ahead Log (WAL). 更新存储在 WAL 中后,会写入内存中MemStoreAfter the update is stored in the WAL, it's written to the in-memory MemStore. 当内存中的数据达到其最大容量时,将以HFile的形式写入磁盘。When the data in memory reaches its maximum capacity, it's written to disk as an HFile.

如果在刷新 MemStore 之前, RegionServer崩溃或变得不可用,则可以使用 "提前写" 日志来重播更新。If a RegionServer crashes or becomes unavailable before the MemStore is flushed, the Write Ahead Log can be used to replay updates. 如果不使用 WAL, RegionServer在刷新对HFile的更新之前崩溃,则所有更新都将丢失。Without the WAL, if a RegionServer crashes before flushing updates to an HFile, all of those updates are lost.

适用于 Apache HBase 的 Azure HDInsight 中的加速写入功能Accelerated Writes feature in Azure HDInsight for Apache HBase

加速写入功能可解决由于使用云存储中的预写日志而导致更高写入延迟的问题。The Accelerated Writes feature solves the problem of higher write-latencies caused by using Write Ahead Logs that are in cloud storage. 适用于 HDInsight Apache HBase 群集的加速写入功能将高级 SSD 托管磁盘附加到每个 RegionServer (辅助节点)。The Accelerated Writes feature for HDInsight Apache HBase clusters, attaches premium SSD-managed disks to every RegionServer (worker node). 然后将写入的日志写入到在这些高级托管磁盘(而不是云存储)上装载的 Hadoop 文件系统(HDFS)。Write Ahead Logs are then written to the Hadoop File System (HDFS) mounted on these premium managed-disks instead of cloud storage. 高级托管磁盘使用固态磁盘(Ssd),并提供具有容错能力的出色 i/o 性能。Premium managed-disks use Solid-State Disks (SSDs) and offer excellent I/O performance with fault tolerance. 与非托管磁盘不同,如果一个存储单元出现故障,则它不会影响同一可用性集中的其他存储单元。Unlike unmanaged disks, if one storage unit goes down, it won't affect other storage units in the same availability set. 因此,托管磁盘为应用程序提供低写入延迟和更好的复原能力。As a result, managed-disks provide low write-latency and better resiliency for your applications. 若要详细了解 Azure 托管磁盘,请参阅azure 托管磁盘简介To learn more about Azure-managed disks, see Introduction to Azure managed disks.

如何在 HDInsight 中启用 HBase 的加速写入How to enable Accelerated Writes for HBase in HDInsight

若要使用加速写入功能创建新的 HBase 群集,请按照在HDInsight 中设置群集中的步骤进行操作 ,直到达到步骤3的存储To create a new HBase cluster with the Accelerated Writes feature, follow the steps in Set up clusters in HDInsight until you reach Step 3, Storage. 在 "元存储设置" 下,选中 "启用 HBase 加速写入" 旁边的复选框。Under Metastore Settings, select the checkbox next to Enable HBase accelerated writes. 然后,继续执行群集创建的剩余步骤。Then, continue with the remaining steps for cluster creation.

为 HDInsight Apache HBase 启用加速写入选项

其他注意事项Other considerations

若要保持数据持久性,请创建至少包含三个辅助角色节点的群集。To preserve data durability, create a cluster with a minimum of three worker nodes. 创建后,不能将群集向下扩展到小于三个工作节点。Once created, you can't scale down the cluster to less than three worker nodes.

请在删除群集之前刷新或禁用 HBase 表,以便不会丢失写入日志数据。Flush or disable your HBase tables before deleting the cluster, so that you don't lose Write Ahead Log data.

flush 'mytable'
disable 'mytable'

按比例缩小群集时执行类似步骤:刷新表并禁用表以停止传入的数据。Follow similar steps when scaling down your cluster: flush your tables and disable your tables to stop incoming data. 不能将群集缩小到少于三个节点。You can't scale down your cluster to fewer than three nodes.

按照这些步骤操作,可以确保向下滚动,并避免 namenode 进入安全模式,这是由复制或临时文件引起的。Following these steps will ensure a successful scale-down and avoid the possibility of a namenode going into safe mode due to under-replicated or temporary files.

如果 namenode 在缩小后进入安全模式,请使用 hdfs 命令重新复制未复制的块,并从安全模式获取 hdfs。If your namenode does go into safemode after a scale down, use hdfs commands to re-replicate the under-replicated blocks and get hdfs out of safe mode. 此重新复制将允许你成功重新启动 HBase。This re-replication will allow you to restart HBase successfully.

后续步骤Next steps