插入Insert

從選取查詢插入Insert from select queries

INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] select_statement

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

part_spec:
  : (part_col_name1=val1 [, part_col_name2=val2, ...])

從 select 語句的結果資料表將資料插入資料表或資料分割。Insert data into a table or a partition from the result table of a select statement. 資料是由資料行的序數 (排序) 而不是依名稱插入。Data is inserted by ordinal (ordering of columns) and not by names.

注意

如果資料行具有 NOT NULL 條件約束,且 INSERT INTO 語句將資料行值設為 NULLSparkException 則會在 Azure Databricks) 上 (Delta Lake,並擲回。(Delta Lake on Azure Databricks) If a column has a NOT NULL constraint, and an INSERT INTO statement sets a column value to NULL, a SparkException is thrown.

OVERWRITE

覆寫資料表或分割區中的現有資料。Overwrite existing data in the table or the partition. 否則會附加新的資料。Otherwise, new data is appended.

範例Examples

-- Creates a partitioned native parquet table
CREATE TABLE data_source_tab1 (col1 INT, p1 INT, p2 INT)
  USING PARQUET PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)
INSERT INTO data_source_tab1 PARTITION (p1 = 3, p2 = 4)
  SELECT id FROM RANGE(1, 3)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows
INSERT OVERWRITE TABLE default.data_source_tab1 PARTITION (p1 = 3, p2 = 4)
  SELECT id FROM RANGE(3, 5)

將值插入資料表中Insert values into tables

INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

values_row:
    : (val1 [, val2, ...])

將資料插入資料表或資料分割中的資料列值清單。Insert data into a table or a partition from a row value list.

OVERWRITE

覆寫資料表或分割區中的現有資料。Overwrite existing data in the table or the partition. 否則會附加新的資料。Otherwise, new data is appended.

範例Examples

-- Creates a partitioned hive serde table (using the HiveQL syntax)
CREATE TABLE hive_serde_tab1 (col1 INT, p1 INT, p2 INT)
  USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)
INSERT INTO hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
  VALUES (1), (2)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows
INSERT OVERWRITE TABLE hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
  VALUES (3), (4)

動態資料分割插入Dynamic partition inserts

在中 part_spec ,資料分割資料行值是選擇性的。In part_spec, the partition column values are optional. 未完全提供磁碟分割規格時 part_spec ,這類插入稱為 動態資料分割插入多重資料分割插入When the partition specification part_spec is not completely provided, such inserts are called dynamic partition inserts or multi-partition inserts. 未指定值時,這些資料行稱為動態資料分割資料行;否則,它們是靜態分割區資料行。When the values are not specified, these columns are referred to as dynamic partition columns; otherwise, they are static partition columns. 例如,資料分割規格的 (p1 = 3, p2, p3) 靜態資料分割資料行 (p1) 和兩個動態資料分割資料行 (p2p3) 。For example, the partition spec (p1 = 3, p2, p3) has a static partition column (p1) and two dynamic partition columns (p2 and p3).

在中 part_spec ,靜態分割區索引鍵必須位於動態資料分割索引鍵之前。In part_spec, static partition keys must come before the dynamic partition keys. 這表示所有具有常數值的資料分割資料行,都必須出現在沒有指派常數值的其他資料分割資料行之前。This means all partition columns having constant values must appear before other partition columns that do not have an assigned constant value.

動態資料分割資料行的資料分割值是在執行期間決定。The partition values of dynamic partition columns are determined during the execution. 您必須在 [資料列 part_spec 值清單] 或 [選取查詢) 的輸入結果集 (中,最後指定動態資料分割資料行。The dynamic partition columns must be specified last in both part_spec and the input result set (of the row value lists or the select query). 它們是依位置解析,而不是依名稱。They are resolved by position, instead of by names. 因此,訂單必須完全相符。Thus, the orders must be exactly matched.

DataFrameWriter Api 沒有可指定資料分割值的介面。The DataFrameWriter APIs do not have an interface to specify partition values. 因此, insertInto() API 一律使用動態資料分割模式。Therefore, the insertInto() API is always using dynamic partition mode.

重要

在動態資料分割模式中,輸入結果集會產生大量的動態資料分割,因此會產生大量的分割目錄。In the dynamic partition mode, the input result set could result in a large number of dynamic partitions, and thus generate a large number of partition directories.

OVERWRITE

根據目標資料表的類型,此語義會有所不同。The semantics are different based on the type of the target table.

  • Hive SerDe 資料表: INSERT OVERWRITE 不會事先刪除資料分割,而且只會覆寫在執行時間寫入資料的資料分割。Hive SerDe tables: INSERT OVERWRITE doesn’t delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. 這符合 Apache Hive 的語法。This matches Apache Hive semantics. 針對 Hive SerDe 資料表,Spark SQL 會遵循與 Hive 相關的設定,包括 hive.exec.dynamic.partitionhive.exec.dynamic.partition.modeFor Hive SerDe tables, Spark SQL respects the Hive-related configuration, including hive.exec.dynamic.partition and hive.exec.dynamic.partition.mode.
  • 原生資料來源資料表: INSERT OVERWRITE 先刪除所有符合分割區規格的分割區 (例如,資料分割 (a = 1,b) # A3,然後插入所有剩餘的值。Native data source tables: INSERT OVERWRITE first deletes all the partitions that match the partition specification (e.g., PARTITION(a=1, b)) and then inserts all the remaining values. 將會話專屬的設定變更為,即可將原生資料來源資料表的行為變更為與 Hive SerDe 資料表一致 spark.sql.sources.partitionOverwriteMode DYNAMICThe behavior of native data source tables can be changed to be consistent with Hive SerDe tables by changing the session-specific configuration spark.sql.sources.partitionOverwriteMode to DYNAMIC. 預設模式為 STATICThe default mode is STATIC.

範例Examples

-- Create a partitioned native Parquet table
CREATE TABLE data_source_tab2 (col1 INT, p1 STRING, p2 STRING)
  USING PARQUET PARTITIONED BY (p1, p2)

-- Two partitions ('part1', 'part1') and ('part1', 'part2') are created by this dynamic insert.
-- The dynamic partition column p2 is resolved by the last column `'part' || id`
INSERT INTO data_source_tab2 PARTITION (p1 = 'part1', p2)
  SELECT id, 'part' || id FROM RANGE(1, 3)

-- A new partition ('partNew1', 'partNew2') is added by this INSERT OVERWRITE.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'partNew1', p2)
  VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, the two partitions ('part1', 'part1') and ('part1', 'part2') are dropped,
-- because both partitions are included by (p1 = 'part1', p2).
-- Then, two partitions ('partNew1', 'partNew2'), ('part1', 'part1') exist after this operation.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'part1', p2)
  VALUES (5, 'part1')

-- Create and fill a partitioned hive serde table with three partitions:
-- ('part1', 'part1'), ('part1', 'part2') and ('partNew1', 'partNew2')
CREATE TABLE hive_serde_tab2 (col1 INT, p1 STRING, p2 STRING)
  USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)
INSERT INTO hive_serde_tab2 PARTITION (p1 = 'part1', p2)
  SELECT id, 'part' || id FROM RANGE(1, 3)
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'partNew1', p2)
  VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, only the partitions ('part1', 'part1') is overwritten by the new value.
-- All the three partitions still exist.
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'part1', p2)
  VALUES (5, 'part1')

將值插入目錄Insert values into directory

INSERT OVERWRITE [LOCAL] DIRECTORY [directory_path]
  USING data_source [OPTIONS (key1=val1, key2=val2, ...)]
  [AS] SELECT ... FROM ...

select_statement directory_path 使用 Spark 原生格式,將查詢結果插入目錄。Insert the query results of select_statement into a directory directory_path using Spark native format. 如果指定的路徑存在,則會以的輸出取代 select_statementIf the specified path exists, it is replaced with the output of the select_statement.

DIRECTORY

插入之目的地目錄的路徑。The path of the destination directory of the insert. 您也可以使用金鑰來指定目錄 OPTIONS pathThe directory can also be specified in OPTIONS using the key path. 如果指定的路徑存在,則會以的輸出取代 select_statementIf the specified path exists, it is replaced with the output of the select_statement. 如果 LOCAL 使用,則目錄是在本機檔案系統上。If LOCAL is used, the directory is on the local file system.

USING

要用於插入的檔案格式。The file format to use for the insert. 、、、、、、和的其中一個, TEXT CSV JSON JDBC PARQUET ORC HIVE LIBSVM 或自訂執行的完整類別名稱 org.apache.spark.sql.sources.DataSourceRegisterOne of TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, and LIBSVM, or a fully qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister.

AS

使用 select 語句中的輸入資料填入目的地目錄。Populate the destination directory with input data from the select statement.

範例Examples

INSERT OVERWRITE DIRECTORY
USING parquet
OPTIONS ('path' '/tmp/destination/path')
SELECT key, col1, col2 FROM source_table

INSERT OVERWRITE DIRECTORY '/tmp/destination/path'
USING json
SELECT 1 as a, 'c' as b

使用 Hive 格式將值插入目錄Insert values into directory with Hive format

INSERT OVERWRITE [LOCAL] DIRECTORY directory_path
  [ROW FORMAT row_format] [STORED AS file_format]
  [AS] select_statement

使用 Hive SerDe 將查詢結果插入 select_statement 目錄 directory_pathInsert the query results of select_statement into a directory directory_path using Hive SerDe. 如果指定的路徑存在,則會以的輸出取代 select_statementIf the specified path exists, it is replaced with the output of the select_statement.

注意

只有在啟用 Hive 支援時,才支援這個命令。This command is supported only when Hive support is enabled.

DIRECTORY

插入之目的地目錄的路徑。The path of the destination directory of the insert. 如果指定的路徑存在,則會以的輸出取代 select_statementIf the specified path exists, it will be replaced with the output of the select_statement. 如果 LOCAL 使用,則目錄是在本機檔案系統上。If LOCAL is used, the directory is on the local file system.

ROW FORMAT

SERDE 可以使用子句來指定此插入的自訂 SerDe。Use the SERDE clause to specify a custom SerDe for this insert. 否則,請使用 DELIMITED 子句來使用原生 SerDe,並指定分隔符號、escape 字元、null 字元等等。Otherwise, use the DELIMITED clause to use the native SerDe and specify the delimiter, escape character, null character, and so on.

STORED AS

此插入的檔案格式。The file format for this insert. 、、、、和的其中一個 TEXTFILE SEQUENCEFILE RCFILE ORC PARQUET AVROOne of TEXTFILE, SEQUENCEFILE, RCFILE, ORC, PARQUET, and AVRO. 或者,您也可以透過和來指定自己的輸入和輸出格式 INPUTFORMAT OUTPUTFORMATAlternatively, you can specify your own input and output format through INPUTFORMAT and OUTPUTFORMAT. 只有 TEXTFILESEQUENCEFILE 和可以搭配使用 RCFILE ROW FORMAT SERDE ,而且只能 TEXTFILE ROW FORMAT DELIMITED 搭配使用。Only TEXTFILE, SEQUENCEFILE, and RCFILE can be used with ROW FORMAT SERDE, and only TEXTFILE can be used with ROW FORMAT DELIMITED.

AS

使用 select 語句中的輸入資料填入目的地目錄。Populate the destination directory with input data from the select statement.

範例Examples

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/destination/path'
STORED AS orc
SELECT * FROM source_table where key < 10