GROUP BY 子句

GROUP BY子句用于基于一组指定的分组表达式对行进行分组,并基于一个或多个指定的聚合函数对行组计算聚合。 Databricks Runtime 还支持高级聚合,以便通过 GROUPING SETS 、、子句为同一输入记录集执行多个聚合 CUBEROLLUP 。 分组表达式和高级聚合可以在子句中混合 GROUP BY 并嵌套在 GROUPING SETS 子句中。

请参阅混合/嵌套分组分析部分中的更多详细信息。

FILTER 子句附加到聚合函数时,只会将匹配的行传递给该函数。

语法

GROUP BY group_expression [, ...] [ WITH ROLLUP | WITH CUBE ]

GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } ( grouping_set [, ...] ) } [, ...]

grouping_set
   { expression |
     ( [ expression [, ...] ] ) }

而聚合函数被定义为

aggregate_name ( [ DISTINCT ] expression [, ...] ) [ FILTER ( WHERE boolean_expression ) ]

参数

  • group_expression

    指定基于其组合在一起的行的条件。 基于分组表达式的结果值执行行的分组。 分组表达式可以是列名称(如) GROUP BY a, a 、列位置(如 GROUP BY 0 )或表达式(如) GROUP BY a + b

  • grouping_set

    分组集由括号中的零个或多个逗号分隔的表达式指定。 如果分组集只有一个元素,则可以省略括号。 例如,将集分组 ( () , (b) ) 与 (a、b) 的分组集相同。

  • GROUPING SETS

    对之后指定的每个分组集的行进行分组 GROUPING SETS 。 例如:

    GROUP BY GROUPING SETS ((warehouse), (product)) 在语义上等效于和的结果联合 GROUP BY warehouseGROUP BY product

    此子句是的简写形式, UNION ALL 运算符的每个阶段对 UNION ALL 子句中指定的每个分组集执行聚合 GROUPING SETS

    同样, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) 在语义上等效于的结果与 GROUP BY warehouse, productGROUP BY product 和全局聚合的并集。

注意

对于 Hive 兼容 Databricks Runtime 允许 GROUP BY ... GROUPING SETS (...)GROUP BY通常会忽略表达式,但如果它们包含表达式以外的其他表达式,则 GROUPING SETS 额外的表达式将包含在分组表达式中,并且值始终为 null。 例如, SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b) 列 c 的输出始终为 null。

  • ROLLUP

    在一个语句中指定多个级别的聚合。 此子句用于基于多个分组集计算聚合。 ROLLUPGROUPING SETS 的速记。 例如:

    GROUP BY warehouse, product WITH ROLLUPGROUP BY ROLLUP(warehouse, product) 等效于

    GROUP BY GROUPING SETS((warehouse, product), (warehouse), ()).

    GROUP BY ROLLUP(warehouse, product, (warehouse, location))

    相当于 GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ())

    规范的 N 个元素将 ROLLUP 导致 n + 1 GROUPING SETS

  • CUBE

    CUBE子句用于基于子句中指定的分组列的组合来执行聚合 GROUP BYCUBEGROUPING SETS 的速记。 例如:

    GROUP BY warehouse, product WITH CUBEGROUP BY CUBE(warehouse, product) 等效于

    GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ()).

    GROUP BY CUBE(warehouse, product, (warehouse, location))

    相当于 GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse), (product), (warehouse, product), ())

    CUBE 规范的 N 个元素将得到 2^N 个 GROUPING SETS

  • aggregate_name

    聚合函数名称(MIN、MAX、COUNT、SUM、AVG 等)。

  • DISTINCT

    在将输入行传递给聚合函数之前,将其中的重复项删除。

  • FILTER

    筛选 WHERE 子句中 boolean_expression 计算结果为 true 的输入行,并将其传递给聚合函数;将放弃其他行。

# # #m 混合/嵌套分组分析

GROUP BY子句可以包括多个 group_expressions 和多个 CUBEROLLUPGROUPING SETS

GROUPING SETS 还可以具有嵌套 CUBE 的、 ROLLUPGROUPING SETS 子句。 例如:

GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location)), GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))

CUBEROLLUP 只是的语法 GROUPING SETS 。 请参阅上述部分,了解如何转换和转换 CUBEROLLUPGROUPING SETSgroup_expression 可以 GROUPING SETS 在此上下文中视为单个组。

对于子句中的多个 GROUPING SETSGROUP BY ,Databricks Runtime GROUPING SETS 通过执行原始的叉积生成单个 GROUPING SETS

对于嵌套 GROUPING SETS 在子句中的 GROUPING SETS ,Databricks Runtime 只是采用其分组集并将其拖放。 例如:

GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())

GROUP BY warehouse, ROLLUP(product), CUBE(location, size)

等效于 GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size), (warehouse))

GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))

相当于 GROUP BY GROUPING SETS((warehouse), (warehouse, product))

示例

CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
    (100, 'Fremont', 'Honda Civic', 10),
    (100, 'Fremont', 'Honda Accord', 15),
    (100, 'Fremont', 'Honda CRV', 7),
    (200, 'Dublin', 'Honda Civic', 20),
    (200, 'Dublin', 'Honda Accord', 10),
    (200, 'Dublin', 'Honda CRV', 3),
    (300, 'San Jose', 'Honda Civic', 5),
    (300, 'San Jose', 'Honda Accord', 8);

-- Sum of quantity per dealership. Group by `id`.
> SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
  id sum(quantity)
 --- -------------
 100            32
 200            33
 300            13

-- Use column position in GROUP by clause.
> SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
  id sum(quantity)
 --- -------------
 100            32
 200            33
 300            13

-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
> SELECT id, sum(quantity) AS sum, max(quantity) AS max
    FROM dealer GROUP BY id ORDER BY id;
  id sum max
 --- --- ---
 100  32  15
 200  33  20
 300  13   8

-- Count the number of distinct dealers in cities per car_model.
> SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
    car_model count
 ------------ -----
  Honda Civic     3
    Honda CRV     2
 Honda Accord     3

-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
> SELECT id,
         sum(quantity) FILTER (WHERE car_model IN ('Honda Civic', 'Honda CRV')) AS `sum(quantity)`
    FROM dealer
    GROUP BY id ORDER BY id;
  id sum(quantity)
 --- -------------
 100            17
 200            23
 300             5

-- Aggregations using multiple sets of grouping columns in a single statement.
-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. Empty grouping set. Returns quantities for all city and car models.
> SELECT city, car_model, sum(quantity) AS sum
    FROM dealer
    GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
    ORDER BY city;
      city    car_model sum
 --------- ------------ ---
      null         null  78
      null  HondaAccord  33
      null     HondaCRV  10
      null   HondaCivic  35
    Dublin         null  33
    Dublin  HondaAccord  10
    Dublin     HondaCRV   3
    Dublin   HondaCivic  20
   Fremont         null  32
   Fremont  HondaAccord  15
   Fremont     HondaCRV   7
   Fremont   HondaCivic  10
  San Jose         null  13
  San Jose  HondaAccord   8
  San Jose   HondaCivic   5

-- Group by processing with `ROLLUP` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
> SELECT city, car_model, sum(quantity) AS sum
    FROM dealer
    GROUP BY city, car_model WITH ROLLUP
    ORDER BY city, car_model;
      city    car_model sum
 --------- ------------ ---
      null         null  78
    Dublin         null  33
    Dublin  HondaAccord  10
    Dublin     HondaCRV   3
    Dublin   HondaCivic  20
   Fremont         null  32
   Fremont  HondaAccord  15
   Fremont     HondaCRV   7
   Fremont   HondaCivic  10
  San Jose         null  13
  San Jose  HondaAccord   8
  San Jose   HondaCivic   5

-- Group by processing with `CUBE` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
> SELECT city, car_model, sum(quantity) AS sum
    FROM dealer
    GROUP BY city, car_model WITH CUBE
    ORDER BY city, car_model;
      city    car_model sum
 --------- ------------ ---
      null         null  78
      null  HondaAccord  33
      null     HondaCRV  10
      null   HondaCivic  35
    Dublin         null  33
    Dublin  HondaAccord  10
    Dublin     HondaCRV   3
    Dublin   HondaCivic  20
   Fremont         null  32
   Fremont  HondaAccord  15
   Fremont     HondaCRV   7
   Fremont   HondaCivic  10
  San Jose         null  13
  San Jose  HondaAccord   8
  San Jose   HondaCivic   5

--Prepare data for ignore nulls example
> CREATE TABLE person (id INT, name STRING, age INT);
> INSERT INTO person VALUES
    (100, 'Mary', NULL),
    (200, 'John', 30),
    (300, 'Mike', 80),
    (400, 'Dan' , 50);

--Select the first row in column age
> SELECT FIRST(age) FROM person;
  first(age, false)
 --------------------
  NULL

--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
> SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
  first(age, true)    last(id, false)    sum(id)
 ------------------- ------------------ ----------
  30                  400                1000