This section covers information about loading data specifically for ML and DL applications. For general information about loading data, see Data.
Store files for data loading and model checkpointing
Machine learning applications may need to use shared storage for data loading and model checkpointing. This is particularly important for distributed deep learning. Databricks provides Databricks File System (DBFS) for accessing data on a cluster using both Spark and local file APIs.
- Databricks Runtime 6.3 ML (Unsupported) and above: Azure Databricks provides a high performance FUSE mount.
- Databricks Runtime 5.5 LTS ML: Azure Databricks provides
dbfs:/ml, a special folder that offers high-performance I/O for deep learning workloads, that maps to
file:/dbfs/mlon driver and worker nodes. Azure Databricks recommends saving data under
/dbfs/ml. This FUSE mount also alleviates the local file I/O API limitation in Databricks Runtime of supporting only files smaller than 2GB.
Load tabular data
You can load tabular machine learning data from tables or files (for example, see CSV file). You can convert Apache Spark DataFrames into pandas DataFrames using the PySpark toPandas method, and then optionally convert to NumPy format using the pandas to_numpy method.
Prepare data for distributed training
This section covers two methods for preparing data for distributed training: Petastorm and TFRecords.