使用 Petastorm 加载数据Load data using Petastorm

Petastorm是一种开放源数据访问库。Petastorm is an open source data access library. 此库可直接从 Apache Parquet 格式的数据集或已加载为 Apache Spark DataFrames 的数据集,对深度学习模型进行单节点或分布式培训和评估。This library enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. Petastorm 支持基于 Python 的热门机器学习 (ML) 框架,如 Tensorflow、PyTorch 和 PySpark。Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark. 有关 Petastorm 的详细信息,请参阅Petastorm GitHub 页Petastorm API 文档For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.

使用 Petastorm 从 Spark DataFrames 加载数据Load data from Spark DataFrames using Petastorm

Petastorm Spark 转换器 API 简化了从 Spark 到 TensorFlow 或 PyTorch 的数据转换。The Petastorm Spark converter API simplifies data conversion from Spark to TensorFlow or PyTorch. 输入 Spark 数据帧首先具体化为 Parquet 格式,然后作为 tf.data.Dataset 或加载 torch.utils.data.DataLoaderThe input Spark DataFrame is first materialized in Parquet format and then loaded as a tf.data.Dataset or torch.utils.data.DataLoader. 请参阅 Petastorm API 文档中的Spark 数据集转换器 API 部分See the Spark Dataset Converter API section in the Petastorm API documentation.

建议的工作流为:The recommended workflow is:

  1. 使用 Apache Spark 来加载数据,还可以选择对数据进行预处理。Use Apache Spark to load and optionally preprocess data.
  2. 使用 Petastorm spark_dataset_converter 方法将 Spark 数据帧中的数据转换为 TensorFlow 数据集或 PyTorch DataLoader。Use the Petastorm spark_dataset_converter method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.
  3. 将数据传入 DL 框架进行定型或推理。Feed data into a DL framework for training or inference.

配置缓存目录Configure cache directory

Petastorm Spark 转换器以 Parquet 格式将输入 Spark 数据帧缓存到用户指定的缓存目录位置。The Petastorm Spark converter caches the input Spark DataFrame in Parquet format in a user-specified cache directory location. 缓存目录必须是以开头的 DBFS 熔断器路径 file:///dbfs/ ,例如, file:///dbfs/tmp/foo/ 其引用的位置与相同 dbfs:/tmp/foo/The cache directory must be a DBFS FUSE path starting with file:///dbfs/, for example, file:///dbfs/tmp/foo/ which refers to the same location as dbfs:/tmp/foo/. 可以通过两种方式配置缓存目录:You can configure the cache directory in two ways:

  • 在群集Spark 配置中添加以下行:petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...In the cluster Spark config add the line: petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...

  • 在笔记本中,调用 spark.conf.set()In your notebook, call spark.conf.set():

    from petastorm.spark import SparkDatasetConverter, make_spark_converter
    
    spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///dbfs/...')
    

使用缓存后,可以通过 converter.delete() 在对象存储中配置生命周期规则来隐式调用或管理缓存来显式删除缓存。You can either explicitly delete the cache after using it by calling converter.delete() or manage the cache implicitly by configuring the lifecycle rules in your object storage.

Databricks 支持三种方案中的 DL 培训:Databricks supports DL training in three scenarios:

  • 单节点训练Single-node training
  • 分布式超参数优化Distributed hyperparameter tuning
  • 分布式训练Distributed training

有关端到端示例,请参阅以下笔记本:For end-to-end examples, see the following notebooks:

使用 Petastorm 直接加载 Parquet 文件Load Parquet files directly using Petastorm

此方法比 Petastorm Spark 转换器 API 更少。This method is less preferred than the Petastorm Spark converter API.

建议的工作流为:The recommended workflow is:

  1. 使用 Apache Spark 来加载数据,还可以选择对数据进行预处理。Use Apache Spark to load and optionally preprocess data.
  2. 将 Parquet 格式的数据另存为具有随附保险丝 mount 的 DBFS 路径。Save data in Parquet format into a DBFS path that has a companion FUSE mount.
  3. 通过保险丝装入点以 Petastorm 格式加载数据。Load data in Petastorm format via the FUSE mount point.
  4. 在 DL 框架中使用数据进行定型或推理。Use data in a DL framework for training or inference.

有关端到端示例,请参阅示例笔记本See example notebook for an end-to-end example.

示例Examples

简化从 Spark 到 TensorFlow 笔记本的 数据转换 Simplify data conversion from Spark to TensorFlow notebook

获取笔记本Get notebook

简化从 Spark 到 PyTorch 笔记本的 数据转换 Simplify data conversion from Spark to PyTorch notebook

获取笔记本Get notebook

使用 Spark 和 Petastorm 为深度学习笔记本 准备数据 Use Spark and Petastorm to prepare data for deep learning notebook

获取笔记本Get notebook