Prepare data for distributed training

Article
03/01/2024

This article describes two methods for preparing data for distributed training: Petastorm and TFRecords.

Petastorm (Recommended)

Petastorm is an open source data access library that enables directly loading data stored in Apache Parquet format. This is convenient for Azure Databricks and Apache Spark users because Parquet is the recommended data format. The following article illustrates this use case:

Load data using Petastorm

TFRecord

You can also use TFRecord format as the data source for distributed deep learning. TFRecord format is a simple record-oriented binary format that many TensorFlow applications use for training data.

tf.data.TFRecordDataset is the TensorFlow dataset, which is comprised of records from TFRecords files. For more details about how to consume TFRecord data, see the TensorFlow guide Consuming TFRecord data.

The following articles describe and illustrate the recommended ways to save your data to TFRecord files and load TFRecord files:

Save Apache Spark DataFrames as TFRecord files

Prepare data for distributed training

Petastorm (Recommended)

TFRecord

Feedback

Feedback

Additional resources