PyTorch

發行項
03/06/2024

PyTorch 專案是 Python 套件，可提供 GPU 加速張量計算和高階功能，以建置深度學習網路。如需授權詳細數據，請參閱 GitHub 上的 PyTorch 授權檔。

若要監視和偵錯 PyTorch 模型，請考慮使用 TensorBoard。

PyTorch 包含在適用於機器學習的 Databricks Runtime 中。如果您使用 Databricks Runtime，請參閱安裝 PyTorch 以取得安裝 PyTorch 的指示。

注意

這不是 PyTorch 的完整指南。如需詳細資訊，請參閱 PyTorch 網站。

單一節點和分散式定型

若要測試和移轉單一計算機工作流程，請使用單一節點叢集。

如需深度學習的分散式定型選項，請參閱分散式定型。

範例筆記本

PyTorch 筆記本

取得筆記本

安裝 PyTorch

適用於 ML 的 Databricks Runtime

適用於機器學習的 Databricks Runtime 包含 PyTorch，因此您可以建立叢集並開始使用 PyTorch。如需您在 Databricks Runtime ML 版本中安裝的 PyTorch 版本，請參閱版本資訊。

Databricks 執行階段

Databricks 建議您使用 Databricks Runtime 中隨附的 PyTorch 來進行機器學習。不過，如果您必須使用標準 Databricks Runtime，則可以將 PyTorch 安裝為 Databricks PyPI 連結庫。下列範例示範如何安裝 PyTorch 1.5.0：

在 GPU 叢集上，安裝 pytorch 並 torchvision 指定下列專案：
- torch==1.5.0
- torchvision==0.6.0

在 CPU 叢集上，使用下列 Python 轉輪檔案來安裝 pytorch 及 torchvision ：

https://download.pytorch.org/whl/cpu/torch-1.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl

https://download.pytorch.org/whl/cpu/torchvision-0.6.0%2Bcpu-cp37-cp37m-linux_x86_64.whl

分散式 PyTorch 的錯誤和疑難解答

下列各節說明類別的常見錯誤訊息和疑難解答指引： PyTorch DataParallel 或 PyTorch DistributedDataParallel。這些錯誤大部分都可能透過 TorchDistributor 來解決，您可以在 Databricks Runtime ML 13.0 和更新版本上使用。不過，如果 TorchDistributor 不是可行的解決方案，也會在每個區段中提供建議的解決方案。

以下是如何使用 TorchDistributor 的範例：


from pyspark.ml.torch.distributor import TorchDistributor

def train_fn(learning_rate):
        # ...

num_processes=2
distributor = TorchDistributor(num_processes=num_processes, local_mode=True)

distributor.run(train_fn, 1e-3)

`process 0 terminated with exit code 1`

不論環境為何，使用筆記本時都會發生此錯誤：Databricks、本機計算機等。若要避免此錯誤，請搭配使用 torch.multiprocessing.start_processes ， start_method=fork 而不是 torch.multiprocessing.spawn。

例如：

import torch

def train_fn(rank, learning_rate):
    # required setup, e.g. setup(rank)
        # ...

num_processes = 2
torch.multiprocessing.start_processes(train_fn, args=(1e-3,), nprocs=num_processes, start_method="fork")

`The server socket has failed to bind to [::]:{PORT NUMBER} (errno: 98 - Address already in use).`

當您在定型發生時中斷單元格之後重新啟動分散式定型時，就會發生此錯誤。

若要解決，請重新啟動叢集。如果這無法解決問題，定型函式程式代碼中可能會發生錯誤。

您可以遇到 CUDA 的其他問題，因為 start_method=”fork” 與 CUDA 不相容。在任何數據格中使用任何 .cuda 命令可能會導致失敗。若要避免這些錯誤，請在呼叫 torch.multiprocessing.start_method之前新增下列檢查：

if torch.cuda.is_initialized():
    raise Exception("CUDA was initialized; distributed training will fail.") # or something similar

訓練 PyTorch 模型