Databricks Runtime 5.3 ML (Unsupported)

Databricks released this image in April 2019.

Databricks Runtime 5.3 ML provides a ready-to-go environment for machine learning and data science based on Databricks Runtime 5.3 (Unsupported). Databricks Runtime for ML contains many popular machine learning libraries, including TensorFlow, PyTorch, Keras, and XGBoost. It also supports distributed deep learning training using Horovod.

For more information, including instructions for creating a Databricks Runtime ML cluster, see Databricks Runtime for Machine Learning.

New features

Databricks Runtime 5.3 ML is built on top of Databricks Runtime 5.3. For information on what’s new in Databricks Runtime 5.3, see the Databricks Runtime 5.3 (Unsupported) release notes. In addition to library updates, Databricks Runtime 5.3 ML introduces the following new features:

  • MLflow + Apache Spark MLlib integration: Databricks Runtime 5.3 ML supports automatic logging of MLflow runs for models fit using PySpark tuning algorithms CrossValidator and TrainValidationSplit.

    Important

    This feature is in Private Preview. Contact your Azure Databricks sales representative to learn about enabling it.

  • Upgrades the following libraries to the latest version:

    • PyArrow from 0.8.0 to 0.12.1: BinaryType is supported by Arrow-based conversion and can be used in PandasUDF.
    • Horovod from 0.15.2 to 0.16.0.
    • TensorboardX from 1.4 to 1.6.

The Databricks ML Model Export API has been deprecated. Azure Databricks recommends using MLeap instead, which provides broader coverage of MLlib model types. Find out more at MLeap ML Model Export.

Note

In addition, Databricks Runtime 5.3 contains a new FUSE mount optimized for data loading, model checkpointing, and logging from each worker to a shared storage location file:/dbfs/ml, which provides high-performance I/O for deep learning workloads. See Prepare Storage for Data Loading and Model Checkpointing.

Maintenance updates

See Databricks Runtime 5.4 ML maintenance updates.

System environment

The system environment in Databricks Runtime 5.3 ML differs from Databricks Runtime 5.3 as follows:

  • Python: 2.7.15 for Python 2 clusters and 3.6.5 for Python 3 clusters.
  • DBUtils: Databricks Runtime 5.3 ML does not contain Library utilities.
  • For GPU clusters, the following NVIDIA GPU libraries:
    • Tesla driver 396.44
    • CUDA 9.2
    • CUDNN 7.2.1

Libraries

The following sections list the libraries included in Databricks Runtime 5.3 ML that differ from those included in Databricks Runtime 5.3.

Top-tier libraries

Databricks Runtime 5.3 ML includes the following top-tier libraries:

Python libraries

Databricks Runtime 5.3 ML uses Conda for Python package management. As a result, there are major differences in pre-installed Python libraries compared to Databricks Runtime. The following is a full list of provided Python packages and versions installed using Conda package manager.

Library Version Library Version Library Version
absl-py 0.7.0 argparse 1.4.0 asn1crypto 0.24.0
astor 0.7.1 backports-abc 0.5 backports.functools-lru-cache 1.5
backports.weakref 1.0.post1 bcrypt 3.1.6 bleach 2.1.3
boto 2.48.0 boto3 1.7.62 botocore 1.10.62
certifi 2018.04.16 cffi 1.11.5 chardet 3.0.4
cloudpickle 0.5.3 colorama 0.3.9 configparser 3.5.0
cryptography 2.2.2 cycler 0.10.0 Cython 0.28.2
decorator 4.3.0 docutils 0.14 entrypoints 0.2.3
enum34 1.1.6 et-xmlfile 1.0.1 funcsigs 1.0.2
functools32 3.2.3-2 fusepy 2.0.4 futures 3.2.0
gast 0.2.2 grpcio 1.12.1 h5py 2.8.0
horovod 0.16.0 html5lib 1.0.1 idna 2.6
ipaddress 1.0.22 ipython 5.7.0 ipython_genutils 0.2.0
jdcal 1.4 Jinja2 2.10 jmespath 0.9.3
jsonschema 2.6.0 jupyter-client 5.2.3 jupyter-core 4.4.0
Keras 2.2.4 Keras-Applications 1.0.6 Keras-Preprocessing 1.0.5
kiwisolver 1.0.1 linecache2 1.0.0 llvmlite 0.23.1
lxml 4.2.1 Markdown 3.0.1 MarkupSafe 1.0
matplotlib 2.2.2 mistune 0.8.3 mleap 0.8.1
mock 2.0.0 msgpack 0.5.6 nbconvert 5.3.1
nbformat 4.4.0 nose 1.3.7 nose-exclude 0.5.0
numba 0.38.0+0.g2a2b772fc.dirty numpy 1.14.3 olefile 0.45.1
openpyxl 2.5.3 pandas 0.23.0 pandocfilters 1.4.2
paramiko 2.4.1 pathlib2 2.3.2 patsy 0.5.0
pbr 5.1.1 pexpect 4.5.0 pickleshare 0.7.4
Pillow 5.1.0 pip 10.0.1 ply 3.11
prompt-toolkit 1.0.15 protobuf 3.6.1 psutil 5.6.0
psycopg2 2.7.5 ptyprocess 0.5.2 pyarrow 0.12.1
pyasn1 0.4.5 pycparser 2.18 Pygments 2.2.0
PyNaCl 1.3.0 pyOpenSSL 18.0.0 pyparsing 2.2.0
PySocks 1.6.8 Python 2.7.15 python-dateutil 2.7.3
pytz 2018.4 PyYAML 3.12 pyzmq 17.0.0
requests 2.18.4 s3transfer 0.1.13 scandir 1.7
scikit-learn 0.19.1 scipy 1.1.0 seaborn 0.8.1
setuptools 39.1.0 simplegeneric 0.8.1 singledispatch 3.4.0.3
six 1.11.0 statsmodels 0.9.0 subprocess32 3.5.3
tensorboard 1.12.2 tensorboardX 1.6 tensorflow 1.12.0
termcolor 1.1.0 testpath 0.3.1 torch 0.4.1
torchvision 0.2.1 tornado 5.0.2 traceback2 1.4.0
traitlets 4.3.2 unittest2 1.1.0 urllib3 1.22
virtualenv 16.0.0 wcwidth 0.1.7 webencodings 0.5.1
Werkzeug 0.14.1 wheel 0.31.1 wrapt 1.10.11
wsgiref 0.1.2

In addition, the following Spark packages include Python modules:

Spark Package Python Module Version
graphframes graphframes 0.7.0-db1-spark2.4
spark-deep-learning sparkdl 1.5.0-db1-spark2.4
tensorframes tensorframes 0.6.0-s_2.11

R libraries

The R libraries are identical to the R Libraries in Databricks Runtime 5.3.

Java and Scala libraries (Scala 2.11 cluster)

In addition to Java and Scala libraries in Databricks Runtime 5.3, Databricks Runtime 5.3 ML contains the following JARs:

Group ID Artifact ID Version
com.databricks spark-deep-learning 1.5.0-db1-spark2.4
com.typesafe.akka akka-actor_2.11 2.3.11
ml.combust.mleap mleap-databricks-runtime_2.11 0.13.0
ml.dmlc xgboost4j 0.81
ml.dmlc xgboost4j-spark 0.81
org.graphframes graphframes_2.11 0.7.0-db1-spark2.4
org.tensorflow libtensorflow 1.12.0
org.tensorflow libtensorflow_jni 1.12.0
org.tensorflow spark-tensorflow-connector_2.11 1.12.0
org.tensorflow tensorflow 1.12.0
org.tensorframes tensorframes 0.6.0-s_2.11