您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure HDInsight 的 Apache HBase 加速写入Azure HDInsight Accelerated Writes for Apache HBase

本文提供有关 Azure HDInsight 中 Apache HBase 的加速写入功能的背景信息,以及如何使用它来有效提高写入性能。This article provides background on the Accelerated Writes feature for Apache HBase in Azure HDInsight, and how it can be used effectively to improve write performance. 加速写入使用 Azure 高级 SSD 托管磁盘来提高 Apache HBase 预写日志 (WAL) 的性能。Accelerated Writes uses Azure premium SSD managed disks to improve performance of the Apache HBase Write Ahead Log (WAL). 有关 Apache HBase 的详细信息,请参阅 HDInsight 中的 Apache HBase 是什么To learn more about Apache HBase, see What is Apache HBase in HDInsight.

HBase 体系结构概述Overview of HBase architecture

在 HBase 中,由一个或多个构成,并由行键标识。In HBase, a row consists of one or more columns and is identified by a row key. 多个行构成了一个Multiple rows make up a table. 列包含单元格 - 该列中的值的带时间戳版本。Columns contain cells, which are timestamped versions of the value in that column. 列分组成列系列,列系列中的所有列一起存储在名为 HFile 的存储文件中。Columns are grouped into column families, and all columns in a column-family are stored together in storage files called HFiles.

HBase 中的区域用于平衡数据处理负载。Regions in HBase are used to balance the data processing load. HBase 最初在单个区域中存储表的行。HBase first stores the rows of a table in a single region. 随着表中的数据量不断增大,行将分散到多个区域。The rows are spread across multiple regions as the amount of data in the table increases. 区域服务器可以处理多个区域的请求。Region Servers can handle requests for multiple regions.

Apache HBase 的预写日志Write Ahead Log for Apache HBase

HBase 最初会将数据更新写入到一种名为“预写日志”(WAL) 的提交日志中。HBase first writes data updates to a type of commit log called a Write Ahead Log (WAL). 更新存储到 WAL 后,将写入到内存中的 MemStoreAfter the update is stored in the WAL, it's written to the in-memory MemStore. 当内存中的数据达到其最大容量时,将作为 HFile 写入到磁盘中。When the data in memory reaches its maximum capacity, it's written to disk as an HFile.

如果在刷写 MemStore 之前区域服务器崩溃或不可用,可以使用预写日志来重放更新。If a RegionServer crashes or becomes unavailable before the MemStore is flushed, the Write Ahead Log can be used to replay updates. 在不使用 WAL 的情况下,如果在将更新刷写到 HFile 之前区域服务器崩溃,所有这些更新都会丢失。Without the WAL, if a RegionServer crashes before flushing updates to an HFile, all of those updates are lost.

Azure HDInsight 中 Apache HBase 的加速写入功能Accelerated Writes feature in Azure HDInsight for Apache HBase

加速写入功能解决了使用云存储中的预写日志导致写入延迟增大的问题。The Accelerated Writes feature solves the problem of higher write-latencies caused by using Write Ahead Logs that are in cloud storage. HDInsight Apache HBase 群集的加速写入功能将高级 SSD 托管磁盘附加到每个区域服务器(工作器节点)。The Accelerated Writes feature for HDInsight Apache HBase clusters, attaches premium SSD-managed disks to every RegionServer (worker node). 然后,预写日志将写入到这些高级托管磁盘中装载的 Hadoop 文件系统 (HDFS),而不是写入到云存储。Write Ahead Logs are then written to the Hadoop File System (HDFS) mounted on these premium managed-disks instead of cloud storage. 高级托管磁盘使用固态硬盘 (SSD),提供卓越的 I/O 性能和容错能力。Premium managed-disks use Solid-State Disks (SSDs) and offer excellent I/O performance with fault tolerance. 与非托管磁盘不同,如果一个存储单元出现故障,则它不会影响同一可用性集中的其他存储单元。Unlike unmanaged disks, if one storage unit goes down, it won't affect other storage units in the same availability set. 因此,托管磁盘可为应用程序提供较低的写入延迟和更好的复原能力。As a result, managed-disks provide low write-latency and better resiliency for your applications. 有关 Azure 托管磁盘的详细信息,请参阅 Azure 托管磁盘简介To learn more about Azure-managed disks, see Introduction to Azure managed disks.

如何启用 HDInsight 中 HBase 的加速写入How to enable Accelerated Writes for HBase in HDInsight

若要使用加速写入功能创建新的 HBase 群集,请执行在 HDInsight 中设置群集中的步骤,直到“步骤 3:存储”。 To create a new HBase cluster with the Accelerated Writes feature, follow the steps in Set up clusters in HDInsight until you reach Step 3, Storage. 在“元存储设置”下,选中“启用 HBase 加速写入”旁边的复选框。 Under Metastore Settings, select the checkbox next to Enable HBase accelerated writes. 然后,继续执行剩余的步骤创建群集。Then, continue with the remaining steps for cluster creation.

启用 HDInsight Apache HBase 的加速写入选项

其他注意事项Other considerations

若要保留数据持久性,请创建至少包含三个工作器节点的群集。To preserve data durability, create a cluster with a minimum of three worker nodes. 创建后,无法将群集缩减为包含三个以下的工作器节点。Once created, you can't scale down the cluster to less than three worker nodes.

在删除群集之前,请刷写或禁用 HBase 表,以免丢失预写日志数据。Flush or disable your HBase tables before deleting the cluster, so that you don't lose Write Ahead Log data.

flush 'mytable'
disable 'mytable'

缩减群集时按照类似的步骤操作:刷新表并禁用表以停止传入数据。Follow similar steps when scaling down your cluster: flush your tables and disable your tables to stop incoming data. 不能将群集缩减为少于三个节点。You can't scale down your cluster to fewer than three nodes.

按照这些步骤操作将确保成功缩小规模,并避免由于复制不足或临时文件而导致 namenode 进入安全模式的可能性。Following these steps will ensure a successful scale-down and avoid the possibility of a namenode going into safe mode due to under-replicated or temporary files.

如果在缩小规模后 namenode 确实进入了安全模式,请使用 hdfs 命令重新复制复制不足的块,并使 hdfs 退出安全模式。If your namenode does go into safemode after a scale down, use hdfs commands to re-replicate the under-replicated blocks and get hdfs out of safe mode. 通过此复制,可以成功地重启 HBase。This re-replication will allow you to restart HBase successfully.

后续步骤Next steps