優化 PySpark 與 pandas 資料框架之間的轉換Optimize conversion between PySpark and pandas DataFrames

Apache 箭 號是 Apache Spark 中用來有效率地在 JVM 和 Python 程式之間傳輸資料的記憶體中單欄式資料格式。Apache Arrow is an in-memory columnar data format used in Apache Spark to efficiently transfer data between JVM and Python processes. 這對使用 pandas 和 NumPy 資料的 Python 開發人員而言很有説明。This is beneficial to Python developers that work with pandas and NumPy data. 不過,其使用方式並不是自動的,而且需要對設定或程式碼進行一些細微的變更,以充分利用並確保相容性。However, its usage is not automatic and requires some minor changes to configuration or code to take full advantage and ensure compatibility.

PyArrow 版本PyArrow versions

PyArrow 安裝在 Databricks Runtime 中。PyArrow is installed in Databricks Runtime. 如需每個 Databricks Runtime 版本中可用之 PyArrow 版本的詳細資訊,請參閱 Databricks 運行時版本資訊。For information on the version of PyArrow available in each Databricks Runtime version, see the Databricks runtime release notes.

支援的 SQL 類型Supported SQL types

所有 Spark SQL 資料類型都受到以箭號為基礎的轉換支援 MapType ,除了、 ArrayType 和之外 TimestampType StructTypeAll Spark SQL data types are supported by Arrow-based conversion except MapType, ArrayType of TimestampType, and nested StructType. StructType 表示為, pandas.DataFrame 而不是 pandas.SeriesStructType is represented as a pandas.DataFrame instead of pandas.Series. BinaryType 只有當 PyArrow 等於或高於0.10.0 時才支援。BinaryType is supported only when PyArrow is equal to or higher than 0.10.0.

將 PySpark 資料框架轉換為 pandas 資料框架(& a)Convert PySpark DataFrames to and from pandas DataFrames

當您將 PySpark 資料框架轉換為 pandas 資料框架時 toPandas() ,以及從 PySpark 資料框架使用建立 Pandas 資料框架時,會使用箭號作為優化 createDataFrame(pandas_df)Arrow is available as an optimization when converting a PySpark DataFrame to a pandas DataFrame with toPandas() and when creating a PySpark DataFrame from a pandas DataFrame with createDataFrame(pandas_df). 若要使用這些方法的箭號,請將 Spark設定設 spark.sql.execution.arrow.enabledtrueTo use Arrow for these methods, set the Spark configuration spark.sql.execution.arrow.enabled to true. 預設會停用此設定。This configuration is disabled by default.

此外, spark.sql.execution.arrow.enabled 如果在 Spark 內的計算之前發生錯誤,則啟用的優化可能會切換回非箭號的實作為。In addition, optimizations enabled by spark.sql.execution.arrow.enabled could fall back to a non-Arrow implementation if an error occurs before the computation within Spark. 您可以使用 Spark 設定來控制此行為 spark.sql.execution.arrow.fallback.enabledYou can control this behavior using the Spark configuration spark.sql.execution.arrow.fallback.enabled.

範例Example

import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

使用箭號優化會產生與未啟用箭號一樣的結果。Using the Arrow optimizations produces the same results as when Arrow is not enabled. 即使使用箭號,也 toPandas() 會將資料框架中的所有記錄收集至驅動程式,並應在資料的一小部分進行。Even with Arrow, toPandas() results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data.

此外,並非所有的 Spark 資料類型都受到支援,而且如果資料行有不支援的類型,就可能引發錯誤。In addition, not all Spark data types are supported and an error can be raised if a column has an unsupported type. 如果發生錯誤 createDataFrame() ,Spark 會切換回建立沒有箭號的資料框架。If an error occurs during createDataFrame(), Spark falls back to create the DataFrame without Arrow.