您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Apache Phoenix 性能最佳做法Apache Phoenix performance best practices

Apache Phoenix 性能的最重要方面是优化基础 Apache HBaseThe most important aspect of Apache Phoenix performance is to optimize the underlying Apache HBase. Phoenix 在 HBase 的顶层创建一个关系数据模型,用于将 SQL 查询转换为 HBase 操作,例如扫描。Phoenix creates a relational data model atop HBase that converts SQL queries into HBase operations, such as scans. 表架构的设计、主键中字段的选择和排序,以及索引的用法都会影响 Phoenix 的性能。The design of your table schema, the selection and ordering of the fields in your primary key, and your use of indexes all affect Phoenix performance.

表架构设计Table schema design

在 Phoenix 中创建一个表时,该表将存储在 HBase 表中。When you create a table in Phoenix, that table is stored in an HBase table. HBase 表包含可统一访问的列组(列系列)。The HBase table contains groups of columns (column families) that are accessed together. Phoenix 表中的行是 HBase 表中的行,其中每个行由版本受控的单元格构成,这些单元格与一个或多个列相关联。A row in the Phoenix table is a row in the HBase table, where each row consists of versioned cells associated with one or more columns. 在逻辑上,单个 HBase 行是键值对的集合,这些键值对具有相同的行键值。Logically, a single HBase row is a collection of key-value pairs, each having the same rowkey value. 这就是说,每个键值对具有一个行键属性,特定行的该行键属性值是相同的。That is, each key-value pair has a rowkey attribute, and the value of that rowkey attribute is the same for a particular row.

Phoenix 表的架构设计包括主键设计、列系列设计、单个列的设计,以及数据分区方式。The schema design of a Phoenix table includes the primary key design, column family design, individual column design, and how the data is partitioned.

主键设计Primary key design

在 Phoenix 中的表上定义的主键确定如何将数据存储在基础 HBase 表的行键中。The primary key defined on a table in Phoenix determines how data is stored within the rowkey of the underlying HBase table. 在 HBase 中,访问特定行的唯一方法就是使用行键。In HBase, the only way to access a particular row is with the rowkey. 此外,存储在 HBase 表中的数据已按行键排序。In addition, data stored in an HBase table is sorted by the rowkey. Phoenix 通过将行中每个列的值与主键中定义的顺序连接来生成 rowkey 值。Phoenix builds the rowkey value by concatenating the values of each of the columns in the row, in the order they're defined in the primary key.

例如,联系人表包含名字、姓氏、电话号码和地址,所有这些数据都包含在同一个列系列中。For example, a table for contacts has the first name, last name, phone number, and address, all in the same column family. 可以基于不断递增的序列号定义主键:You could define a primary key based on an increasing sequence number:

rowkeyrowkey addressaddress phonephone firstNamefirstName lastNamelastName
10001000 1111 San Gabriel Dr.1111 San Gabriel Dr. 1-425-000-00021-425-000-0002 JohnJohn DoleDole
83968396 5415 San Gabriel Dr.5415 San Gabriel Dr. 1-230-555-01911-230-555-0191 CalvinCalvin RajiRaji

但是,如果经常按 lastName 执行查询,则使用此主键可能性能不佳,因为每个查询需要扫描整个表才能读取每个 lastName 的值。However, if you frequently query by lastName this primary key may not perform well, because each query requires a full table scan to read the value of every lastName. 此时,可以基于 lastName、firstName 和社会安全号码列定义主键。Instead, you can define a primary key on the lastName, firstName, and social security number columns. 最后一列用于区分位于同一地址、使用相同姓名的两位居民(例如父亲和儿子)。This last column is to disambiguate two residents at the same address with the same name, such as a father and son.

rowkeyrowkey addressaddress phonephone firstNamefirstName lastNamelastName socialSecurityNumsocialSecurityNum
10001000 1111 San Gabriel Dr.1111 San Gabriel Dr. 1-425-000-00021-425-000-0002 JohnJohn DoleDole 111111
83968396 5415 San Gabriel Dr.5415 San Gabriel Dr. 1-230-555-01911-230-555-0191 CalvinCalvin RajiRaji 222222

Phoenix 使用此新主键生成的行键是:With this new primary key the row keys generated by Phoenix would be:

rowkeyrowkey addressaddress phonephone firstNamefirstName lastNamelastName socialSecurityNumsocialSecurityNum
Dole-John-111Dole-John-111 1111 San Gabriel Dr.1111 San Gabriel Dr. 1-425-000-00021-425-000-0002 JohnJohn DoleDole 111111
Raji-Calvin-222Raji-Calvin-222 5415 San Gabriel Dr.5415 San Gabriel Dr. 1-230-555-01911-230-555-0191 CalvinCalvin RajiRaji 222222

在上面的第一行中,行键的数据按如下方式表示:In the first row above, the data for the rowkey is represented as shown:

rowkeyrowkey keykey value
Dole-John-111Dole-John-111 addressaddress 1111 San Gabriel Dr.1111 San Gabriel Dr.
Dole-John-111Dole-John-111 phonephone 1-425-000-00021-425-000-0002
Dole-John-111Dole-John-111 firstNamefirstName JohnJohn
Dole-John-111Dole-John-111 lastNamelastName DoleDole
Dole-John-111Dole-John-111 socialSecurityNumsocialSecurityNum 111111

现在,此行键存储了数据的重复副本。This rowkey now stores a duplicate copy of the data. 请考虑要包含在主键中的列大小和数目,因为此值将与基础 HBase 表中的每个单元格包含在一起。Consider the size and number of columns you include in your primary key, because this value is included with every cell in the underlying HBase table.

此外,如果主键包含单调递增的值,则应使用盐桶创建表,以帮助避免产生写入热点 - 请参阅将分区数据Also, if the primary key has values that are monotonically increasing, you should create the table with salt buckets to help avoid creating write hotspots - see Partition data.

列系列设计Column family design

如果某些列的访问频率比其他列更高,应创建多个列系列,将经常访问的列与极少访问列区分开来。If some columns are accessed more frequently than others, you should create multiple column families to separate the frequently accessed columns from rarely accessed columns.

此外,如果某些列往往是一起访问的,可将这些列放在同一个列系列中。Also, if certain columns tend to be accessed together, put those columns in the same column family.

列设计Column design

  • 由于大列的 i/o 开销,导致 VARCHAR 列小于约 1 MB。Keep VARCHAR columns under about 1 MB because of the I/O costs of large columns. 处理查询时,HBase 会将单元格作为一个整体具体化,然后将其发送到客户端。客户端会作为一个整体接收这些单元格,然后将其转交到应用程序代码。When processing queries, HBase materializes cells in full before sending them over to the client, and the client receives them in full before handing them off to the application code.
  • 使用 protobuf、Avro、msgpack 或 BSON 等紧凑格式存储列值。Store column values using a compact format such as protobuf, Avro, msgpack, or BSON. 不建议使用 JSON,因为它更大。JSON isn't recommended, as it's larger.
  • 在存储之前考虑压缩数据,以降低延迟和 I/O 开销。Consider compressing data before storage to cut latency and I/O costs.

将数据分区Partition data

使用 Phoenix 可以控制数据分发到的区域数目,从而大幅提高读/写性能。Phoenix enables you to control the number of regions where your data is distributed, which can significantly increase read/write performance. 创建 Phoenix 表时,可以将数据加盐或预先拆分。When creating a Phoenix table, you can either salt or pre-split your data.

若要在创建过程中给表加盐,请指定盐桶数目:To salt a table during creation, specify the number of salt buckets:

CREATE TABLE CONTACTS (...) SALT_BUCKETS = 16

此加盐过程将连同主键值一起拆分表,并自动选择值。This salting splits the table along the values of primary keys, choosing the values automatically.

若要控制表的拆分位置,可以通过提供拆分所要遵循的范围值,来预先拆分表。To control where the table splits occur, you can pre-split the table by providing the range values along which the splitting occurs. 例如,若要创建一个沿着三个区域拆分的表:For example, to create a table split along three regions:

CREATE TABLE CONTACTS (...) SPLIT ON ('CS','EU','NA')

索引设计Index design

Phoenix 索引是一个 HBase 表,存储索引表中的部分或全部数据的副本。A Phoenix index is an HBase table that stores a copy of some or all of the data from the indexed table. 索引可以提高特定类型的查询的性能。An index improves performance for specific types of queries.

如果定义多个索引后查询表,Phoenix 会自动选择查询的最佳索引。When you have multiple indexes defined and then query a table, Phoenix automatically selects the best index for the query. 主索引是根据所选的主键自动创建的。The primary index is created automatically based on the primary keys you select.

对于预见性查询,还可以通过指定查询的列来创建辅助索引。For anticipated queries, you can also create secondary indexes by specifying their columns.

设计索引时:When designing your indexes:

  • 只创建所需的索引。Only create the indexes you need.
  • 限制频繁更新表的索引数。Limit the number of indexes on frequently updated tables. 对的表更新将解释为同时写入主表和索引表。Updates to a table translate into writes to both the main table and the index tables.

创建辅助索引Create secondary indexes

辅助索引将完整表扫描转化为点查找,因此可以提高读取性能,代价是消耗更多的存储空间和降低写入速度。Secondary indexes can improve read performance by turning what would be a full table scan into a point lookup, at the cost of storage space and write speed. 可以在创建表之后添加或删除辅助索引,它们不需要对现有查询进行更改 – 查询速度只会更快。Secondary indexes can be added or removed after table creation and don’t require changes to existing queries – queries just run faster. 请考虑根据需要创建涵盖索引和/或功能索引。Depending on your needs, consider creating covered indexes, functional indexes, or both.

使用涵盖索引Use covered indexes

涵盖索引是包含行中的数据以及已编制索引的值的索引。Covered indexes are indexes that include data from the row in addition to the values that are indexed. 查找所需的索引条目后,无需访问主表。After finding the desired index entry, there's no need to access the primary table.

例如,在示例联系人表中,可以只是基于 socialSecurityNum 列创建辅助索引。For example, in the example contact table you could create a secondary index on just the socialSecurityNum column. 此辅助索引可以加速按 socialSecurityNum 值执行筛选的查询,但检索其他字段值需要针对主表执行另一次读取。This secondary index would speed up queries that filter by socialSecurityNum values, but retrieving other field values will require another read against the main table.

rowkeyrowkey addressaddress phonephone firstNamefirstName lastNamelastName socialSecurityNumsocialSecurityNum
Dole-John-111Dole-John-111 1111 San Gabriel Dr.1111 San Gabriel Dr. 1-425-000-00021-425-000-0002 JohnJohn DoleDole 111111
Raji-Calvin-222Raji-Calvin-222 5415 San Gabriel Dr.5415 San Gabriel Dr. 1-230-555-01911-230-555-0191 CalvinCalvin RajiRaji 222222

但是,在指定 socialSecurityNum 的情况下,如果你往往还要查找 firstName 和 lastName,则可以创建一个涵盖索引,并在其中包含 firstName 和 lastName 作为索引表中的实际数据:However, if you typically want to look up the firstName and lastName given the socialSecurityNum, you could create a covered index that includes the firstName and lastName as actual data in the index table:

CREATE INDEX ssn_idx ON CONTACTS (socialSecurityNum) INCLUDE(firstName, lastName);

通过此涵盖索引,以下查询只需从包含辅助索引的表中读取数据,即可获取所有数据:This covered index enables the following query to acquire all data just by reading from the table containing the secondary index:

SELECT socialSecurityNum, firstName, lastName FROM CONTACTS WHERE socialSecurityNum > 100;

使用功能索引Use functional indexes

使用功能索引可以基于预期要在查询中使用的任意表达式创建索引。Functional indexes allow you to create an index on an arbitrary expression that you expect to be used in queries. 创建功能索引以及使用该表达式的查询后,该索引可用于检索结果而不是数据表。Once you have a functional index in place and a query uses that expression, the index may be used to retrieve the results rather than the data table.

例如,可以创建一个索引,以便根据某人的名字和姓氏组合来执行不区分大小写的搜索:For example, you could create an index to allow you to do case-insensitive searches on the combined first name and last name of a person:

 CREATE INDEX FULLNAME_UPPER_IDX ON "Contacts" (UPPER("firstName"||' '||"lastName"));

查询设计Query design

查询设计的主要考虑因素是:The main considerations in query design are:

  • 了解查询计划并验证其预期行为。Understand the query plan and verify its expected behavior.
  • 有效联接。Join efficiently.

了解查询计划Understand the query plan

SQLLine 中,依次使用 EXPLAIN 和 SQL 查询来查看 Phoenix 将要执行的操作计划。In SQLLine, use EXPLAIN followed by your SQL query to view the plan of operations that Phoenix will perform. 检查该计划:Check that the plan:

  • 是否在适当的情况下使用主键。Uses your primary key when appropriate.
  • 是否使用适当的辅助索引而不是数据表。Uses appropriate secondary indexes, rather than the data table.
  • 是否尽量 RANGE SCAN 或 SKIP SCAN,而不是 TABLE SCAN。Uses RANGE SCAN or SKIP SCAN whenever possible, rather than TABLE SCAN.

计划示例Plan examples

举个例子,假设有一个名为 FLIGHTS 的表,其中存储了航班延迟信息。As an example, say you have a table called FLIGHTS that stores flight delay information.

若要选择 airlineid 为 19805的所有航班,其中 airlineid 是不在主键或任何索引中的字段:To select all the flights with an airlineid of 19805, where airlineid is a field that isn't in the primary key or in any index:

select * from "FLIGHTS" where airlineid = '19805';

按如下所示运行 explain 命令:Run the explain command as follows:

explain select * from "FLIGHTS" where airlineid = '19805';

查询计划如下所示:The query plan looks like this:

CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER FLIGHTS
    SERVER FILTER BY AIRLINEID = '19805'

在此计划中,请注意短语 FULL SCAN OVER FLIGHTS。In this plan, note the phrase FULL SCAN OVER FLIGHTS. 此短语表示针对表中的所有行执行了 TABLE SCAN,而没有使用更高效的 RANGE SCAN 或 SKIP SCAN。This phrase indicates the execution does a TABLE SCAN over all rows in the table, rather than using the more efficient RANGE SCAN or SKIP SCAN option.

现在,假设你要查询 carrier(航空公司)AA 在 2014 年 1 月 2 日 flightnum(航班号)大于 1 的航班。Now, say you want to query for flights on January 2, 2014 for the carrier AA where its flightnum was greater than 1. 假设 year、month、dayofmonth、carrier 和 flightnum 列在示例表中存在,并且都包含在复合主键中。Let's assume that the columns year, month, dayofmonth, carrier, and flightnum exist in the example table, and are all part of the composite primary key. 查询如下所示:The query would look as follows:

select * from "FLIGHTS" where year = 2014 and month = 1 and dayofmonth = 2 and carrier = 'AA' and flightnum > 1;

使用以下代码检查此查询的计划:Let's examine the plan for this query with:

explain select * from "FLIGHTS" where year = 2014 and month = 1 and dayofmonth = 2 and carrier = 'AA' and flightnum > 1;

生成的计划为:The resulting plan is:

CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER FLIGHTS [2014,1,2,'AA',2] - [2014,1,2,'AA',*]

方括号中的值是主键的值范围。The values in square brackets are the range of values for the primary keys. 在本例中,范围值是固定的,即年份 2014、月份 1 和月份日期 2,但允许航班号 2 和更大的值 (*)。In this case, the range values are fixed with year 2014, month 1, and dayofmonth 2, but allow values for flightnum starting with 2 and on up (*). 此查询计划确认已按预期使用主键。This query plan confirms that the primary key is being used as expected.

接下来,基于名为 carrier2_idx 的 FLIGHTS 表创建一个索引,该索引只出现在 carrier 字段中。Next, create an index on the FLIGHTS table named carrier2_idx that is on the carrier field only. 此索引还包含 flightdate、tailnum、origin 和 flightnum 作为涵盖列,这些列的数据也存储在索引中。This index also includes flightdate, tailnum, origin, and flightnum as covered columns whose data is also stored in the index.

CREATE INDEX carrier2_idx ON FLIGHTS (carrier) INCLUDE(FLIGHTDATE,TAILNUM,ORIGIN,FLIGHTNUM);

假设你要获取航空公司以及航日期和机尾编号,如以下查询所示:Say you want to get the carrier along with the flightdate and tailnum, as in the following query:

explain select carrier,flightdate,tailnum from "FLIGHTS" where carrier = 'AA';

应会看到使用了此索引:You should see this index being used:

CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER CARRIER2_IDX ['AA']

有关 explain 计划结果中可能显示的完整项列表,请参阅 Apache Phoenix 优化指南中的“Explain 计划”部分。For a complete listing of the items that can appear in explain plan results, see the Explain Plans section in the Apache Phoenix Tuning Guide.

有效联接Join efficiently

一般而言,除非一侧较小,否则应避免联接,尤其是针对频繁查询。Generally, you want to avoid joins unless one side is small, especially on frequent queries.

如果需要,可以结合 /*+ USE_SORT_MERGE_JOIN */ 提示执行大型联接,但是,针对大量的行执行大型联接会产生很高的开销。If necessary, you can do large joins with the /*+ USE_SORT_MERGE_JOIN */ hint, but a large join is an expensive operation over huge numbers of rows. 如果所有右侧表的总体大小超过了可用内存,请使用 /*+ NO_STAR_JOIN */ 提示。If the overall size of all right-hand-side tables would exceed the available memory, use the /*+ NO_STAR_JOIN */ hint.

方案Scenarios

以下指导原则描述了一些常用模式。The following guidelines describe some common patterns.

读取密集型工作负荷Read-heavy workloads

对于读取繁重用例,请确保使用的是索引。For read-heavy use cases, make sure you're using indexes. 此外,为了节省读取时间开销,请考虑创建涵盖索引。Additionally, to save read-time overhead, consider creating covered indexes.

写入密集型工作负荷Write-heavy workloads

对于主键单调递增的写入繁重的工作负荷,请创建 salt 存储桶,以帮助避免写入热点,因为需要进行额外的扫描,所以需要支付总体读取吞吐量。For write-heavy workloads where the primary key is monotonically increasing, create salt buckets to help avoid write hotspots, at the expense of overall read throughput because of the additional scans needed. 此外,在使用 UPSERT 写入大量记录时,请关闭 autoCommit 并批处理记录。Also, when using UPSERT to write a large number of records, turn off autoCommit and batch up the records.

批量删除Bulk deletes

删除大型数据集时,在发出删除查询之前打开自动提交,使客户端无需记住所有已删除行的行键。When deleting a large data set, turn on autoCommit before issuing the DELETE query, so that the client doesn't need to remember the row keys for all deleted rows. AutoCommit 会阻止客户端缓冲受 DELETE 影响的行,因此,Phoenix 可以直接在区域服务器上删除这些行,且无需将其返回到客户端。AutoCommit prevents the client from buffering the rows affected by the DELETE, so that Phoenix can delete them directly on the region servers without the expense of returning them to the client.

不可变和仅限追加Immutable and Append-only

如果方案更看重写入速度而不是数据完整性,请考虑在创建表时禁用预写日志:If your scenario favors write speed over data integrity, consider disabling the write-ahead log when creating your tables:

CREATE TABLE CONTACTS (...) DISABLE_WAL=true;

有关此选项和其他选项的详细信息,请参阅 Apache Phoenix 语法For details on this and other options, see Apache Phoenix Grammar.

后续步骤Next steps