插入Insert

从选择查询插入Insert from select queries

INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] select_statement

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

part_spec:
  : (part_col_name1=val1 [, part_col_name2=val2, ...])

从 select 语句的结果表中将数据插入表或分区。Insert data into a table or a partition from the result table of a select statement. 按序号 (列的顺序插入数据) 而不是按名称。Data is inserted by ordinal (ordering of columns) and not by names.

OVERWRITE

覆盖表或分区中的现有数据。Overwrite existing data in the table or the partition. 否则,将追加新数据。Otherwise, new data is appended.

备注

在 Databricks Runtime 7.0 及更高版本中, INSERT 支持通过增量 Lake 进行架构强制和演化。In Databricks Runtime 7.0 and above, INSERT supports schema enforcement and evolution with Delta Lake. 如果列的数据类型不能安全地转换为增量表的数据类型,则会引发运行时异常。If a column’s data type cannot be safely cast to a Delta table’s data type, a runtime exception is thrown. 如果启用了 架构进化 ,则新列可以作为架构的最后一列, (或嵌套列) ,以便架构得以发展。If schema evolution is enabled, new columns can exist as the last columns of your schema (or nested columns) for the schema to evolve.

示例Examples

-- Creates a partitioned native parquet table
CREATE TABLE data_source_tab1 (col1 INT, p1 INT, p2 INT)
  USING PARQUET PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)
INSERT INTO data_source_tab1 PARTITION (p1 = 3, p2 = 4)
  SELECT id FROM RANGE(1, 3)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows
INSERT OVERWRITE TABLE default.data_source_tab1 PARTITION (p1 = 3, p2 = 4)
  SELECT id FROM RANGE(3, 5)

向表中插入值Insert values into tables

INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

values_row:
    : (val1 [, val2, ...])

从行值列表中将数据插入表或分区。Insert data into a table or a partition from a row value list.

OVERWRITE

覆盖表或分区中的现有数据。Overwrite existing data in the table or the partition. 否则,将追加新数据。Otherwise, new data is appended.

示例Examples

-- Creates a partitioned hive serde table (using the HiveQL syntax)
CREATE TABLE hive_serde_tab1 (col1 INT, p1 INT, p2 INT)
  USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)
INSERT INTO hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
  VALUES (1), (2)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows
INSERT OVERWRITE TABLE hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
  VALUES (3), (4)

动态分区插入Dynamic partition inserts

如果 part_spec 未完全提供分区规范,则会将此类插入作为动态分区插入(也称为多分区插入)调用。When the partition specification part_spec is not completely provided, such inserts are called as the dynamic partition inserts, also called as multi-partition inserts. 在中 part_spec ,分区列值是可选的。In part_spec, the partition column values are optional. 如果未提供这些值,则这些列称为动态分区列;否则,它们为静态分区列。When the values are not given, these columns are referred to as dynamic partition columns; otherwise, they are static partition columns. 例如,分区规范 _ (p1 = 3,p2,p3) _ 的静态分区列 (p1) ,两个动态分区列 (p2p3) 。For example, the partition spec (p1 = 3, p2, p3) has a static partition column (p1) and two dynamic partition columns (p2 and p3).

在中 part_spec ,静态分区键必须位于动态分区键之前。In part_spec, the static partition keys must come before the dynamic partition keys. 也就是说,具有常数值的所有分区列都需要在没有分配的常量值的其他分区列之前显示。That means, all partition columns having constant values need to appear before other partition columns that do not have an assigned constant value.

动态分区列的分区值是在执行期间确定的。The partition values of dynamic partition columns are determined during the execution. 动态分区列必须最后指定 part_spec ,并且在行值列表或 select 查询) (的输入结果集中。The dynamic partition columns must be specified last in both part_spec and the input result set (of the row value lists or the select query). 它们按位置而不是按名称进行解析。They are resolved by position, instead of by names. 因此,订单必须完全匹配。Thus, the orders must be exactly matched.

目前,DataFrameWriter Api 没有用于指定分区值的接口。Currently the DataFrameWriter APIs do not have an interface to specify partition values. 因此,其 insertInto() API 始终使用动态分区模式。Therefore, its insertInto() API is always using dynamic partition mode.

重要

在动态分区模式下,输入结果集可能会导致大量的动态分区,从而生成大量的分区目录。In the dynamic partition mode, the input result set could result in a large number of dynamic partitions, and thus generate a large number of partition directories.

OVERWRITE

语义不同于目标表的类型。The semantics are different based on the type of the target table.

  • Hive SerDe 表: INSERT OVERWRITE 不先删除分区,而只在运行时覆盖那些已写入数据的分区。Hive SerDe tables: INSERT OVERWRITE doesn’t delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. 这与 Apache Hive 语义匹配。This matches Apache Hive semantics. 对于 Hive SerDe 表,Spark SQL 遵循与 Hive 相关的配置,包括 hive.exec.dynamic.partitionhive.exec.dynamic.partition.modeFor Hive SerDe tables, Spark SQL respects the Hive-related configuration, including hive.exec.dynamic.partition and hive.exec.dynamic.partition.mode.
  • 本机数据源表: INSERT OVERWRITE 首先删除所有与分区规范匹配的分区 (例如,分区 (a = 1,b) # A3,然后插入所有剩余值。Native data source tables: INSERT OVERWRITE first deletes all the partitions that match the partition specification (e.g., PARTITION(a=1, b)) and then inserts all the remaining values. 通过将特定于会话的配置更改为,可以更改本机数据源表的行为,使其与 Hive SerDe 表保持一致 spark.sql.sources.partitionOverwriteMode DYNAMICThe behavior of native data source tables can be changed to be consistent with Hive SerDe tables by changing the session-specific configuration spark.sql.sources.partitionOverwriteMode to DYNAMIC. 默认模式为STATICThe default mode is STATIC.

示例Examples

-- Create a partitioned native Parquet table
CREATE TABLE data_source_tab2 (col1 INT, p1 STRING, p2 STRING)
  USING PARQUET PARTITIONED BY (p1, p2)

-- Two partitions ('part1', 'part1') and ('part1', 'part2') are created by this dynamic insert.
-- The dynamic partition column p2 is resolved by the last column `'part' || id`
INSERT INTO data_source_tab2 PARTITION (p1 = 'part1', p2)
  SELECT id, 'part' || id FROM RANGE(1, 3)

-- A new partition ('partNew1', 'partNew2') is added by this INSERT OVERWRITE.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'partNew1', p2)
  VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, the two partitions ('part1', 'part1') and ('part1', 'part2') are dropped,
-- because both partitions are included by (p1 = 'part1', p2).
-- Then, two partitions ('partNew1', 'partNew2'), ('part1', 'part1') exist after this operation.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'part1', p2)
  VALUES (5, 'part1')

-- Create and fill a partitioned hive serde table with three partitions:
-- ('part1', 'part1'), ('part1', 'part2') and ('partNew1', 'partNew2')
CREATE TABLE hive_serde_tab2 (col1 INT, p1 STRING, p2 STRING)
  USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)
INSERT INTO hive_serde_tab2 PARTITION (p1 = 'part1', p2)
  SELECT id, 'part' || id FROM RANGE(1, 3)
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'partNew1', p2)
  VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, only the partitions ('part1', 'part1') is overwritten by the new value.
-- All the three partitions still exist.
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'part1', p2)
  VALUES (5, 'part1')

在目录中插入值Insert values into directory

INSERT OVERWRITE [LOCAL] DIRECTORY [directory_path]
  USING data_source [OPTIONS (key1=val1, key2=val2, ...)]
  [AS] SELECT ... FROM ...

select_statement使用 Spark 本机格式将的查询结果插入目录 directory_pathInsert the query results of select_statement into a directory directory_path using Spark native format. 如果指定的路径存在,则将其替换为的输出 select_statementIf the specified path exists, it is replaced with the output of the select_statement.

DIRECTORY

插入的目标目录的路径。The path of the destination directory of the insert. 还可以使用键在中指定目录 OPTIONS pathThe directory can also be specified in OPTIONS using the key path. 如果指定的路径存在,则将其替换为的输出 select_statementIf the specified path exists, it is replaced with the output of the select_statement. 如果 LOCAL 使用,则目录位于本地文件系统上。If LOCAL is used, the directory is on the local file system.

USING

要用于插入的文件格式。The file format to use for the insert. 、、 TEXTCSV JSON 、、、和中的一个, JDBC 或的 PARQUET ORC HIVE LIBSVM 自定义实现的完全限定类名 org.apache.spark.sql.sources.DataSourceRegisterOne of TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, and LIBSVM, or a fully qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister.

AS

用 select 语句中的输入数据填充目标目录。Populate the destination directory with input data from the select statement.

示例Examples

INSERT OVERWRITE DIRECTORY
USING parquet
OPTIONS ('path' '/tmp/destination/path')
SELECT key, col1, col2 FROM source_table

INSERT OVERWRITE DIRECTORY '/tmp/destination/path'
USING json
SELECT 1 as a, 'c' as b

以 Hive 格式将值插入目录Insert values into directory with Hive format

INSERT OVERWRITE [LOCAL] DIRECTORY directory_path
  [ROW FORMAT row_format] [STORED AS file_format]
  [AS] select_statement

使用 Hive SerDe 将的查询结果插入 select_statement 目录 directory_pathInsert the query results of select_statement into a directory directory_path using Hive SerDe. 如果指定的路径存在,则将其替换为的输出 select_statementIf the specified path exists, it is replaced with the output of the select_statement.

备注

仅当启用 Hive 支持时,才支持此命令。This command is supported only when Hive support is enabled.

DIRECTORY

插入的目标目录的路径。The path of the destination directory of the insert. 如果指定的路径存在,则会将其替换为的输出 select_statementIf the specified path exists, it will be replaced with the output of the select_statement. 如果 LOCAL 使用,则目录位于本地文件系统上。If LOCAL is used, the directory is on the local file system.

ROW FORMAT

使用 SERDE 子句为此插入指定自定义 SerDe。Use the SERDE clause to specify a custom SerDe for this insert. 否则,请使用 DELIMITED 子句来使用 Native SerDe,并指定分隔符、转义符、null 字符等。Otherwise, use the DELIMITED clause to use the native SerDe and specify the delimiter, escape character, null character, and so on.

STORED AS

此插入的文件格式。The file format for this insert. 一个、、、、 TEXTFILE SEQUENCEFILE RCFILE ORC PARQUETAVROOne of TEXTFILE, SEQUENCEFILE, RCFILE, ORC, PARQUET, and AVRO. 或者,您可以通过和指定您自己的输入和输出格式 INPUTFORMAT OUTPUTFORMATAlternatively, you can specify your own input and output format through INPUTFORMAT and OUTPUTFORMAT. 只有 TEXTFILESEQUENCEFILERCFILE 可以与一起使用 ROW FORMAT SERDE ,并且只能 TEXTFILE 与一起使用 ROW FORMAT DELIMITEDOnly TEXTFILE, SEQUENCEFILE, and RCFILE can be used with ROW FORMAT SERDE, and only TEXTFILE can be used with ROW FORMAT DELIMITED.

AS

用 select 语句中的输入数据填充目标目录。Populate the destination directory with input data from the select statement.

示例Examples

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/destination/path'
STORED AS orc
SELECT * FROM source_table where key < 10