未預期的 叢集終止 Unexpected cluster termination

有時會意外終止叢集,而不是 手動 終止或設定 自動終止的結果。Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. 有許多原因可以終止叢集。A cluster can be terminated for many reasons. 某些終止是由 Azure Databricks 起始,其他則是由雲端提供者起始。Some terminations are initiated by Azure Databricks and others are initiated by the cloud provider. 本文說明終止原因和補救步驟。This article describes termination reasons and steps for remediation.

已超過 Azure Databricks 起始的要求限制 Azure Databricks initiated request limit exceeded

為了抵禦 API 濫用、確保服務品質,以及防止您不小心建立太多大型叢集,Azure Databricks 會將所有叢集的大小調整要求(包括叢集建立、啟動和調整大小)進行節流。To defend against API abuses, ensure quality of service, and prevent you from accidentally creating too many large clusters, Azure Databricks throttles all cluster up-sizing requests, including cluster creation, starting, and resizing. 節流會使用權杖值區 演算法 來限制任何人可以在您的 Databricks 部署中定義的間隔內啟動的節點總數,同時允許某些大小的高載要求。The throttling uses the token bucket algorithm to limit the total number of nodes that anyone can launch over a defined interval across your Databricks deployment, while allowing burst requests of certain sizes. 來自 web UI 和 Api 的要求會受到速率限制。Requests coming from both the web UI and the APIs are subject to rate limiting. 當叢集要求超過速率限制時,限制超出要求會失敗並出現 REQUEST_LIMIT_EXCEEDED 錯誤。When cluster requests exceed rate limits, the limit-exceeding request fails with a REQUEST_LIMIT_EXCEEDED error.

解決方案Solution

如果您達到合法工作流程的限制,Databricks 建議您執行下列動作:If you hit the limit for your legitimate workflow, Databricks recommends that you do the following:

  • 請在幾分鐘後重試您的要求。Retry your request a few minutes later.
  • 在計畫的時間範圍內,平均分散您的週期性工作流程。Spread out your recurring workflow evenly in the planned time frame. 例如,您可以嘗試在一小時內以不同的間隔來執行所有工作,而不是將所有 工作 排程在每小時的界限執行。For example, instead of scheduling all of your jobs to run at an hourly boundary, try distributing them at different intervals within the hour.
  • 請考慮使用具有較大 節點類型 和較少節點數目的叢集。Consider using clusters with a larger node type and smaller number of nodes.
  • 使用 自動 調整叢集。Use autoscaling clusters.

如果這些選項不適用於您,請聯絡 Azure Databricks 支援服務,要求增加核心實例的限制。If these options don’t work for you, contact Azure Databricks Support to request a limit increase for the core instance.

基於其他 Azure Databricks 起始的終止原因,請參閱 終止程式碼For other Azure Databricks initiated termination reasons, see Termination Code.

雲端提供者起始終止Cloud provider initiated terminations

本文列出常見的雲端提供者相關終止原因和補救步驟。This article lists common cloud provider related termination reasons and remediation steps.

啟動失敗Launch failure

當 Azure Databricks 無法取得虛擬機器時,就會發生此終止原因。This termination reason occurs when Azure Databricks fails to acquire virtual machines. 系統會傳播 API 的錯誤碼和訊息,以協助您針對問題進行疑難排解。The error code and message from the API are propagated to help you troubleshoot the issue.

OperationNotAllowedOperationNotAllowed

您已達到訂用帳戶可啟動的配額限制(通常是核心數目)。You have reached a quota limit, usually number of cores, that your subscription can launch. 要求增加 Azure 入口網站的限制。Request a limit increase in Azure portal. 請參閱 Azure 訂用帳戶和服務限制、配額與限制See Azure subscription and service limits, quotas, and constraints.

PublicIPCountLimitReachedPublicIPCountLimitReached

您已達到可執行檔公用 Ip 的限制。You have reached the limit of the public IPs that you can have running. 在 Azure 入口網站中要求增加限制。Request a limit increase in Azure Portal.

SkuNotAvailableSkuNotAvailable

您選取的資源 SKU (例如 VM 大小) 不適用於您所選取的位置。The resource SKU you have selected (such as VM size) is not available for the location you have selected. 若要解決此問題,請參閱 解決 SKU 無法使用的錯誤To resolve, see Resolve errors for SKU not available.

ReadOnlyDisabledSubscriptionReadOnlyDisabledSubscription

您的訂用帳戶已停用。Your subscription was disabled. 遵循 Azure 訂用帳戶停用的原因,以及如何重新啟用它 來重新啟用您的訂用帳戶。Follow the steps in Why is my Azure subscription disabled and how do I reactivate it? to reactivate your subscription.

ResourceGroupBeingDeletedResourceGroupBeingDeleted

如果有人取消您在 Azure 入口網站中的 Azure Databricks 工作區,而且您嘗試同時建立叢集,就可能發生這種情況。Can occur if someone cancels your Azure Databricks workspace in the Azure portal and you try to create a cluster at the same time. 叢集失敗,因為正在刪除資源群組。The cluster fails because the resource group is being deleted.

SubscriptionRequestsThrottledSubscriptionRequestsThrottled

您的訂用帳戶達到 Azure Resource Manager 要求限制 (請參閱 節流 Resource Manager 要求) 。Your subscription is hitting the Azure Resource Manager request limit (see Throttling Resource Manager requests). 常見的原因是 Azure Databricks 之外的另一個系統) 對 Azure 進行大量的 API 呼叫。Typical cause is that another system outside Azure Databricks) making a lot of API calls to Azure. 請聯絡 Azure 支援識別此系統,然後減少 API 呼叫的數目。Contact Azure support to identify this system and then reduce the number of API calls.

通訊遺失Communication lost

Azure Databricks 可以啟動叢集,但失去裝載 Spark 驅動程式之實例的連接。Azure Databricks was able to launch the cluster, but lost the connection to the instance hosting the Spark driver.

驅動程式虛擬機器中斷或網路問題所造成。Caused by the driver virtual machine going down or a networking issue.