SparkJob 類別

參考

獨立 Spark 作業。

繼承: azure.ai.ml.entities._job.job.Job

SparkJob

azure.ai.ml.entities._job.parameterized_spark.ParameterizedSpark

SparkJob

azure.ai.ml.entities._job.job_io_mixin.JobIOMixin

SparkJob

azure.ai.ml.entities._job.spark_job_entry_mixin.SparkJobEntryMixin

SparkJob

建構函式

SparkJob(*, driver_cores: int | None = None, driver_memory: str | None = None, executor_cores: int | None = None, executor_memory: str | None = None, executor_instances: int | None = None, dynamic_allocation_enabled: bool | None = None, dynamic_allocation_min_executors: int | None = None, dynamic_allocation_max_executors: int | None = None, inputs: Dict | None = None, outputs: Dict | None = None, compute: str | None = None, identity: Dict[str, str] | ManagedIdentityConfiguration | AmlTokenConfiguration | UserIdentityConfiguration | None = None, resources: Dict | SparkResourceConfiguration | None = None, **kwargs)

參數

driver_cores: Optional[int]

用於驅動程式進程的核心數目，僅適用于叢集模式。

driver_memory: Optional[str]

要用於驅動程式進程的記憶體數量，格式化為大小單位尾碼 (「k」、「m」、「g」或「t」) (例如「512m」、「2g」) 。

executor_cores: Optional[int]

要用於每個執行程式的核心數目。

executor_memory: Optional[str]

每個執行程式進程使用的記憶體數量，格式化為大小單位尾碼為 (「k」、「m」、「g」或「t」) (的字串，例如「512m」、「2g」) 。

executor_instances: Optional[int]

執行程式的初始數目。

dynamic_allocation_enabled: Optional[bool]

是否要使用動態資源配置，這會根據工作負載來相應增加和減少向此應用程式註冊的執行程式數目。

dynamic_allocation_min_executors: Optional[int]

如果已啟用動態配置，則執行程式數目的下限。

dynamic_allocation_max_executors: Optional[int]

如果啟用動態配置，執行程式數目的上限。

inputs: Optional[dict[str, Input]]

作業中使用的輸入資料系結對應。

outputs: Optional[dict[str, Output]]

作業中使用的輸出資料系結對應。

compute: Optional[str]

作業執行的計算資源。

identity: Optional[Union[dict[str, str], ManagedIdentityConfiguration, AmlTokenConfiguration, UserIdentityConfiguration]]

在計算上執行時，Spark 作業將使用的身分識別。

範例

設定 SparkJob。


   from azure.ai.ml import Input, Output
   from azure.ai.ml.entities import SparkJob

   spark_job = SparkJob(
       code="./sdk/ml/azure-ai-ml/tests/test_configs/dsl_pipeline/spark_job_in_pipeline/basic_src",
       entry={"file": "sampleword.py"},
       conf={
           "spark.driver.cores": 2,
           "spark.driver.memory": "1g",
           "spark.executor.cores": 1,
           "spark.executor.memory": "1g",
           "spark.executor.instances": 1,
       },
       environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:33",
       inputs={
           "input1": Input(
               type="uri_file", path="azureml://datastores/workspaceblobstore/paths/python/data.csv", mode="direct"
           )
       },
       compute="synapsecompute",
       outputs={"component_out_path": Output(type="uri_folder")},
       args="--input1 ${{inputs.input1}} --output2 ${{outputs.output1}} --my_sample_rate ${{inputs.sample_rate}}",
   )

方法

dump	以 YAML 格式將作業內容傾印到檔案中。
filter_conf_fields	篩選出 conf 屬性的欄位，這些欄位不在 ~azure.ai.ml._schema.job.parameterized_spark 中列出的 Spark 組態欄位中。CONF_KEY_MAP並在自己的字典中傳回它們。

dump

以 YAML 格式將作業內容傾印到檔案中。

dump(dest: str | PathLike | IO, **kwargs) -> None

參數

dest: Union[<xref:PathLike>, str, IO[AnyStr]]

必要

要寫入 YAML 內容的本機路徑或檔案資料流程。如果 dest 是檔案路徑，則會建立新的檔案。如果 dest 是開啟的檔案，則會直接寫入檔案。

kwargs: dict

要傳遞至 YAML 序列化程式的其他引數。

例外狀況

FileExistsError

如果 dest 是檔案路徑且檔案已經存在，則引發。

IOError

如果 dest 是開啟的檔案，而且無法寫入檔案，則引發。

filter_conf_fields

篩選出 conf 屬性的欄位，這些欄位不在 ~azure.ai.ml._schema.job.parameterized_spark 中列出的 Spark 組態欄位中。CONF_KEY_MAP並在自己的字典中傳回它們。

filter_conf_fields() -> Dict[str, str]

傳回

不是 Spark 組態欄位之 conf 欄位的字典。

傳回類型

dict[str, str]

例外狀況

FileExistsError

如果 dest 是檔案路徑且檔案已經存在，則引發。

IOError

如果 dest 是開啟的檔案，而且無法寫入檔案，則引發。

屬性

base_path

資源的基底路徑。

傳回

資源的基底路徑。

傳回類型

str

creation_context

資源的建立內容。

傳回

資源的建立中繼資料。

傳回類型

Optional[SystemData]

entry

environment

要執行 Spark 元件或作業的 Azure ML 環境。

傳回

要執行 Spark 元件或作業的 Azure ML 環境。

傳回類型

Optional[Union[str, Environment]]

id

資源識別碼。

傳回

資源的全域識別碼，Azure Resource Manager (ARM) 識別碼。

傳回類型

Optional[str]

identity

在計算上執行時，Spark 作業將使用的身分識別。

傳回

在計算上執行時，Spark 作業將使用的身分識別。

傳回類型

Optional[Union[ManagedIdentityConfiguration, AmlTokenConfiguration, UserIdentityConfiguration]]

inputs

log_files

作業輸出檔案。

傳回

記錄名稱和 URL 的字典。

傳回類型

Optional[Dict[str, str]]

outputs

resources

作業的計算資源組態。

傳回

作業的計算資源組態。

傳回類型

Optional[SparkResourceConfiguration]

status

工作的狀態。

傳回的常見值包括「Running」、「Completed」和「Failed」。所有可能的值為：

NotStarted - 這是用戶端 Run 物件在雲端提交之前所在的暫時狀態。
啟動 - 執行已在雲端中開始處理。呼叫端此時有執行識別碼。
布建 - 針對指定的作業提交建立隨選計算。
準備 - 正在準備執行環境，且處於兩個階段之一：
- Docker 映射組建
- conda 環境設定
已排入佇列 - 作業會排入計算目標上的佇列。例如，在 BatchAI 中，作業處於佇列狀態

等候所有要求的節點準備就緒時。
執行 - 作業已開始在計算目標上執行。
完成 - 使用者程式碼執行已完成，且執行處於後續處理階段。
CancelRequested - 已要求取消作業。
已完成 - 執行已順利完成。這包括使用者程式碼執行和執行

後續處理階段。
失敗 - 執行失敗。執行上的 Error 屬性通常會提供原因的詳細資料。
已取消 - 遵循取消要求，並指出現在已成功取消執行。
NotResponding - 針對已啟用活動訊號的執行，最近不會傳送活動訊號。

傳回

作業的狀態。

傳回類型

Optional[str]

studio_url

Azure ML Studio 端點。

傳回

作業詳細資料頁面的 URL。

傳回類型

Optional[str]

type

作業的類型。

傳回

作業的類型。

傳回類型

Optional[str]

CODE_ID_RE_PATTERN

CODE_ID_RE_PATTERN = re.compile('\\/subscriptions\\/(?P<subscription>[\\w,-]+)\\/resourceGroups\\/(?P<resource_group>[\\w,-]+)\\/providers\\/Microsoft\\.MachineLearningServices\\/workspaces\\/(?P<workspace>[\\w,-]+)\\/codes\\/(?P<co)

共用方式為

SparkJob 類別

建構函式

參數

範例

方法

dump

參數

例外狀況

filter_conf_fields

傳回

傳回類型

例外狀況

屬性

base_path

傳回

傳回類型

creation_context

傳回

傳回類型

entry

environment

傳回

傳回類型

id

傳回

傳回類型

identity

傳回

傳回類型

inputs

log_files

傳回

傳回類型

outputs

resources

傳回

傳回類型

status

傳回

傳回類型

studio_url

傳回

傳回類型

type

傳回

傳回類型

CODE_ID_RE_PATTERN

意見反應

其他資源