TensorFlow TensorFlow

TensorFlow 是 Google 创建的机器学习的开源框架。TensorFlow is an open-source framework for machine learning created by Google. 它支持对 Cpu、Gpu 和 Gpu 群集进行深入学习和常规数字计算。It supports deep-learning and general numerical computations on CPUs, GPUs, and clusters of GPUs. 这取决于 Apache 2.0 许可证的条款和条件。It is subject to the terms and conditions of the Apache 2.0 License.

以下各节提供有关在 Azure Databricks 上安装 TensorFlow 的指导,并提供运行 TensorFlow 程序的示例。The following sections provide guidance on installing TensorFlow on Azure Databricks and give an example of running TensorFlow programs.

备注

本指南不是 TensorFlow 的综合性指南。This guide is not a comprehensive guide on TensorFlow. 请参阅 TensorFlow 网站See the TensorFlow website.

Databricks Runtime ML 中包含的 TensorFlow 版本TensorFlow versions included in Databricks Runtime ML

机器学习 Databricks Runtime 包含 TensorFlow 和 TensorBoard,因此,无需安装任何包即可使用这些库。Databricks Runtime for Machine Learning includes TensorFlow and TensorBoard so you can use these libraries without installing any packages. 下面是包含的 TensorFlow 版本:Here are the TensorFlow versions included:

Databricks Runtime ML 版Databricks Runtime ML Version TensorFlow 版本TensorFlow Version
7.3-7。47.3 - 7.4 2.3.02.3.0
7.0-7。27.0 - 7.2 2.2.02.2.0
6.3-6。66.3 - 6.6 1.15.01.15.0

安装 TensorFlow Install TensorFlow

本部分提供有关在机器学习和Databricks Runtime Databricks Runtime上安装或降级 TensorFlow 的说明,以便你可以试用 TensorFlow 中的最新功能。This section provides instructions for installing or downgrading TensorFlow on Databricks Runtime for Machine Learning and Databricks Runtime, so that you can try out the latest features in TensorFlow. 由于包依赖关系,其他预安装包可能存在兼容性问题。Due to package dependencies, there might be compatibility issues with other pre-installed packages. 安装完成后,可以通过在 Python 笔记本中执行以下命令来验证已安装的版本:After installation, you can verify the installed version by executing the following command in a Python notebook:

import tensorflow as tf
print([tf.__version__, tf.test.is_gpu_available()])

在 Databricks Runtime 7.2 上安装 TensorFlow 2。3Install TensorFlow 2.3 on Databricks Runtime 7.2

Azure Databricks 建议使用 % pip 和% conda 幻命令安装 TensorFlow。Azure Databricks recommends installing TensorFlow using %pip and %conda magic commands. 在笔记本中运行:In a notebook, run:

%pip install tensorflow-cpu==2.3.*

在 Databricks Runtime 7.2 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 7.2

在笔记本中运行:In a notebook, run:

%pip install tensorflow-cpu==1.15.*

在 Databricks Runtime 7.2 ML 上安装 TensorFlow 2。3Install TensorFlow 2.3 on Databricks Runtime 7.2 ML

在笔记本中运行:In a notebook, run:

CpuCpu

%pip install tensorflow-cpu==2.3.*

GpuGpu

%pip install tensorflow-gpu==2.3.*

在 Databricks Runtime 7.2 ML 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 7.2 ML

在笔记本中运行:In a notebook, run:

CpuCpu

%pip install tensorflow-cpu==1.15.*

GpuGpu

官方 TensorFlow 1.15 版本是针对 CUDA 10.0 构建的,它与安装在 Databricks Runtime 7.0 ML 和更高版本中的 CUDA 10.1 不兼容。The official TensorFlow 1.15 release is built against CUDA 10.0, which is not compatible with CUDA 10.1 installed in Databricks Runtime 7.0 ML and above. Azure Databricks 提供了与 CUDA 10.1 compatbile 的 TensorFlow 1.15.3 的自定义生成。Azure Databricks provides a custom build of TensorFlow 1.15.3 that is compatbile with CUDA 10.1. 使用以下命令来安装它。Use the command below to install it.

%pip install https://databricks-prod-cloudfront.cloud.databricks.com/artifacts/tensorflow/runtime-7.x/tensorflow-1.15.3-cp37-cp37m-linux_x86_64.whl

在 Databricks Runtime 5.5 LTS ML 上安装 TensorFlow 2。3Install TensorFlow 2.3 on Databricks Runtime 5.5 LTS ML

上的群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python

pip install --upgrade pip
pip install tensorflow-cpu==2.3.* setuptools==41.* grpcio==1.24.*

GpuGpu

#!/bin/bash

set -e

apt-get remove -y --auto-remove cuda-toolkit-10-0
apt-get update
apt-get install -y --no-install-recommends --allow-downgrades \
  libnccl2=2.4.8-1+cuda10.1 \
  libnccl-dev=2.4.8-1+cuda10.1 \
  cuda-libraries-10-1 \
  libcudnn7=7.6.4.38-1+cuda10.1 \
  libcudnn7-dev=7.6.4.38-1+cuda10.1 \
  libcublas10=10.2.1.243-1 \
  libcublas-dev=10.2.1.243-1 \
  cuda-libraries-dev-10-1 \
  cuda-compiler-10-1
ln -sfn cuda-10.1 /usr/local/cuda

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python

pip install --upgrade pip
pip install tensorflow==2.3.* setuptools==41.* grpcio==1.24.*

在 Databricks Runtime 5.5 LTS 上安装 TensorFlow 2。3Install TensorFlow 2.3 on Databricks Runtime 5.5 LTS

上的群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-cpu==2.3.* setuptools==41.* pyasn1==0.4.6
/databricks/python/bin/pip uninstall -y numpy
rm -rf /databricks/python/lib/python3.5/site-packages/numpy
/databricks/python/bin/pip install numpy==1.18.4

GpuGpu

#!/bin/bash

set -e

apt-get update
apt-get install -y gnupg-curl

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

apt-get update
apt-get install -y --no-install-recommends --allow-downgrades \
  libnccl2=2.4.8-1+cuda10.1 \
  libnccl-dev=2.4.8-1+cuda10.1 \
  cuda-libraries-10-1 \
  libcudnn7=7.6.4.38-1+cuda10.1 \
  libcudnn7-dev=7.6.4.38-1+cuda10.1 \
  libcublas10=10.2.1.243-1 \
  libcublas-dev=10.2.1.243-1 \
  cuda-libraries-dev-10-1 \
  cuda-compiler-10-1
ln -sfn cuda-10.1 /usr/local/cuda

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow==2.3.* setuptools==41.*
/databricks/python/bin/pip uninstall -y numpy
rm -rf /databricks/python/lib/python3.5/site-packages/numpy
/databricks/python/bin/pip install numpy==1.18.4

TensorFlow 2 已知问题 TensorFlow 2 known issues

TensorFlow 2 与 Python pickling 有已知的不 兼容性TensorFlow 2 has a known incompatibility with Python pickling. 如果你使用的是 PySpark、 HorovodRunnerHyperopt或依赖于 pickling 的任何其他包,则可能会遇到此情况。You might encounter it if you use PySpark, HorovodRunner, Hyperopt, or any other packages that depend on pickling. 解决方法是在函数中显式导入 TensorFlow 模块。The workaround is to explicitly import TensorFlow modules inside your functions. 以下是示例:Here is an example:

import tensorflow as tf

def bad_func(_):
  tf.keras.Sequential()

# You might see an error.
sc.parallelize(range(0)).foreach(bad_func)

def good_func(_):
  import tensorflow as tf
  tf.keras.some_func

# No error.
sc.parallelize(range(0)).foreach(good_func)

在 Databricks Runtime 5.5 LTS ML 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 5.5 LTS ML

Azure Databricks 建议使用 init 脚本在 DATABRICKS RUNTIME 5.5 LTS ML 上安装 TensorFlow 1.15。Azure Databricks recommends installing TensorFlow 1.15 on Databricks Runtime 5.5 LTS ML using an init script.

上的群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda install -y conda=4.6
conda activate /databricks/python

conda install -y tensorflow-mkl=1.15 setuptools=41

GpuGpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda install -y conda=4.6
conda activate /databricks/python

conda install -y tensorflow-gpu=1.15 setuptools=41

在 Databricks Runtime 5.5 LTS 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 5.5 LTS

Azure Databricks 建议使用 init 脚本安装 DATABRICKS RUNTIME 5.5 LTS 上的 TensorFlow 1.15。Azure Databricks recommends installing TensorFlow 1.15 on Databricks Runtime 5.5 LTS using an init script.

上的群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-cpu==1.15.* setuptools==41.*

GpuGpu

#!/bin/bash

set -e

apt-get update
apt-get install -y gnupg-curl

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

apt-get update
apt-get install -y --no-install-recommends cuda-libraries-10-0 libcudnn7=7.4.2.24-1+cuda10.0

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-gpu==1.15.* setuptools==41.*

TensorBoard TensorBoard

TensorBoard 是一套可视化工具,可用于调试、优化和了解 TensorFlow、PyTorch 和其他机器学习程序。TensorBoard is a suite of visualization tools for debugging, optimizing, and understanding TensorFlow, PyTorch, and other machine learning programs.

使用 TensorBoardUse TensorBoard

在 Databricks Runtime 7.2 及更高版本上使用 TensorBoardUse TensorBoard on Databricks Runtime 7.2 and above

在 Azure Databricks 中启动 TensorBoard 与在本地计算机上的 Jupyter 笔记本中启动的方法没有什么不同。Starting TensorBoard in Azure Databricks is no different than starting it on a Jupyter notebook on your local computer.

  1. 加载 %tensorboard 幻命令并定义日志目录。Load the %tensorboard magic command and define your log directory.

    %load_ext tensorboard
    experiment_log_dir = <log-directory>
    
  2. 调用 %tensorboard 神奇命令。Invoke the %tensorboard magic command.

    %tensorboard --logdir $experiment_log_dir
    

    TensorBoard 服务器启动并在笔记本中显示内联用户界面。The TensorBoard server starts and displays the user interface inline in the notebook. 它还提供了在新选项卡中打开 TensorBoard 的链接。It also provides a link to open TensorBoard in a new tab.

    以下屏幕截图显示了在填充的日志目录中启动的 TensorBoard UI。The following screenshot shows the TensorBoard UI started in a populated log directory.

    TensorBoardTensorBoard

还可以直接使用 TensorBoard 的笔记本模块启动 TensorBoard。You can also start TensorBoard by using TensorBoard’s notebook module directly.

from tensorboard import notebook
notebook.start("--logdir {}".format(experiment_log_dir))

在 Databricks Runtime 7.1 和更低的上使用 TensorBoardUse TensorBoard on Databricks Runtime 7.1 and below

若要从笔记本启动 TensorBoard,请使用 dbutils.tensorboard 实用工具。To start TensorBoard from your notebook, use the dbutils.tensorboard utility.

dbutils.tensorboard.start("/tmp/tensorflow_log_dir")

此命令将显示一个链接,单击该链接将在新选项卡中打开 TensorBoard。This command displays a link that, when clicked, opens TensorBoard in a new tab.

使用此 API 开始时 TensorBoard 将继续运行,直到你将其停止,或者关闭了 dbutils.tensorboard.stop() 群集。When started using this API TensorBoard continues to run until you either stop it with dbutils.tensorboard.stop() or you shut down your cluster.

备注

如果将 TensorFlow 作为 Azure Databricks 库附加到群集,则可能需要在启动 TensorBoard 之前重新附加笔记本。If you attach TensorFlow to your cluster as an Azure Databricks library, you may need to reattach your notebook before starting TensorBoard.

TensorBoard 日志和目录TensorBoard logs and directories

TensorBoard 通过读取 TensorBoard 回调和 TensorBoardPyTorch中的函数生成的日志,直观显示你的机器学习程序。TensorBoard visualizes your machine learning programs by reading logs generated by TensorBoard callbacks and functions in TensorBoard or PyTorch. 若要为其他机器学习库生成日志,可以使用 TensorFlow 文件编写器直接编写日志 (参阅 module: TensorFlow 2.X 的,并参阅 TensorFlow 1.x ) 中旧版 API 的 模块: tf .。To generate logs for other machine learning libraries, you can directly write logs using TensorFlow file writers (see Module: tf.summary for TensorFlow 2.x and see Module: tf.compat.v1.summary for the older API in TensorFlow 1.x ).

若要确保实验日志可靠地存储,Azure Databricks 建议将日志写入到 DBFS (即) 下的日志目录, /dbfs/ 而不是暂时的群集文件系统。To make sure that your experiment logs are reliably stored, Azure Databricks recommends writing logs to DBFS (that is, a log directory under /dbfs/) rather than on the ephemeral cluster file system. 对于每个试验,在唯一目录中启动 TensorBoard。For each experiment, start TensorBoard in a unique directory. 对于生成日志的试验中的每次运行机器学习代码,请设置 TensorBoard 回调或 filewriter 以写入试验目录的子目录。For each run of your machine learning code in the experiment that generates logs, set the TensorBoard callback or filewriter to write to a subdirectory of the experiment directory. 这样,TensorBoard UI 中的数据就会被分隔到运行中。That way, the data in the TensorBoard UI will be separated into runs.

阅读官方 TensorBoard 文档 ,开始使用 TensorBoard 记录机器学习程序的信息。Read the official TensorBoard documentation to get started using TensorBoard to log information for your machine learning program.

管理 TensorBoard 进程Manage TensorBoard processes

当笔记本断开或重新启动复制时,Azure Databricks 笔记本中启动的 TensorBoard 进程将不会终止 (例如,当您清除笔记本的状态) 时。The TensorBoard processes started within Azure Databricks notebook are not terminated when the notebook is detached or the REPL is restarted (for example, when you clear the state of the notebook). 若要手动终止 TensorBoard 进程,请使用向其发送终止信号 %sh kill -15 pidTo manually kill a TensorBoard process, send it a termination signal using %sh kill -15 pid. 不正确地终止 TensorBoard 进程可能已损坏 notebook.list()Improperly killed TensorBoard processes may corrupt notebook.list().

若要列出群集上当前运行的 TensorBoard 服务器及其相应的日志目录和进程 Id,请 notebook.list() 从 TensorBoard 笔记本模块运行。To list the TensorBoard servers currently running on your cluster, with their corresponding log directories and process IDs, run notebook.list() from the TensorBoard notebook module.

已知问题Known issues

  • 内联 TensorBoard UI 位于 iframe 中。The inline TensorBoard UI is inside an iframe. 除非在新选项卡中打开该链接,否则浏览器安全功能将阻止 UI 内的外部链接工作。Browser security features prevent external links within the UI from working unless you open the link in a new tab.
  • --window_titleAzure Databricks 上的 TensorBoard 选项被重写。The --window_title option of TensorBoard is overridden on Azure Databricks.
  • 默认情况下,TensorBoard 扫描端口范围以便选择要侦听的端口。By default, TensorBoard scans a port range for selecting a port to listen to. 如果在群集上运行的 TensorBoard 进程太多,则端口范围内的所有端口都可能不可用。If there are too many TensorBoard processes running on the cluster, all ports in the port range may be unavailable. 可以通过使用参数指定端口号来解决此限制 --portYou can work around this limitation by specifying a port number with the --port argument. 指定的端口应介于6006和6106之间。The specified port should be between 6006 and 6106.
  • 为了使下载链接正常工作,应在选项卡中打开 TensorBoard。In order for download links to work, you should open TensorBoard in a tab.
  • 当使用 TensorBoard 1.15.0 时,投影仪选项卡为空白。When using TensorBoard 1.15.0, the Projector tab is blank. 作为一种解决方法,若要直接访问投影仪页面,可以 #projector 在 URL 中将替换为 data/plugin/projector/projector_binary.htmlAs a workaround, to visit the projector page directly, you can replace #projector in the URL by data/plugin/projector/projector_binary.html.

在单节点上使用 TensorFlowUse TensorFlow on a single node

若要测试和迁移单机 TensorFlow 工作流,可以通过将辅助角色数设置为零,从 Azure Databricks 仅驱动程序的群集开始。To test and migrate single-machine TensorFlow workflows, you can start with a driver-only cluster on Azure Databricks by setting the number of workers to zero. 尽管 Apache Spark 在此设置下不起作用,但它是运行单计算机 TensorFlow 工作流的一种经济高效的方式。Though Apache Spark is not functional under this setting, it is a cost-effective way to run single-machine TensorFlow workflows. 以下笔记本显示了如何在仅驱动程序群集上运行 (1.x 和 2.x) 的 TensorFlow 和 TensorBoard 监视。The following notebook shows how you can run TensorFlow (1.x and 2.x), with TensorBoard monitoring on a driver-only cluster.

TensorFlow 1.15/2. x 笔记本TensorFlow 1.15/2.x notebook

获取笔记本Get notebook