使用自動載入器從 Azure Blob 儲存體載入檔案,以及 Azure Data Lake Storage Gen1 和 Gen2Load files from Azure Blob storage and Azure Data Lake Storage Gen1 and Gen2 using Auto Loader

自動載入器會以累加且有效率的方式,在新的資料檔案抵達 Azure Blob 儲存體和 Azure Data Lake Storage Gen1 和 Gen2 時進行處理。Auto Loader incrementally and efficiently processes new data files as they arrive in Azure Blob storage and Azure Data Lake Storage Gen1 and Gen2.

自動載入器會提供一個稱為的結構化串流來源 cloudFilesAuto Loader provides a Structured Streaming source called cloudFiles. 在雲端檔案儲存體上提供輸入目錄路徑時, cloudFiles 來源會在新檔案抵達時自動進行處理,並可選擇是否要處理該目錄中的現有檔案。Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.

自動載入器適用于 DBFS 路徑以及資料來源的直接路徑。The Auto Loader works with DBFS paths as well as direct paths to the data source.

規格需求Requirements

Databricks Runtime 7.2 或更新版本。Databricks Runtime 7.2 or above.

如果您使用 Databricks Runtime 7.1 或更低的層建立資料流程,請參閱 預設選項值和相容性雲端資源管理的變更。If you created streams using Databricks Runtime 7.1 or below, see Changes in default option values and compatibility and Cloud resource management.

檔案探索模式 File discovery modes

自動載入器支援兩種模式來偵測是否有新檔案:目錄清單和檔案通知。Auto Loader supports two modes for detecting when there are new files: directory listing and file notification.

  • 目錄清單:依輸入目錄的平行清單識別新檔案。Directory listing: Identifies new files by parallel listing of the input directory. 目錄清單模式可讓您快速啟動自動載入器資料流程,而不需要任何許可權設定,而且適用于只有少數檔案需要定期進行串流處理的案例。Directory listing mode allows you to quickly start Auto Loader streams without any permission configuration and is suitable for scenarios where only a few files need to be streamed in on a regular basis. 目錄清單模式是 Databricks Runtime 7.2 和更新版本中自動載入器的預設值。Directory listing mode is the default for Auto Loader in Databricks Runtime 7.2 and above.

    在 Databricks Runtime 7.3 LTS 和更新版本中,自動載入器 僅支援在 目錄清單模式中 Azure Data Lake Storage Gen 1。In Databricks Runtime 7.3 LTS and above, Auto Loader supports Azure Data Lake Storage Gen 1 only in directory listing mode.

  • 檔案 通知:使用 Azure 事件方格和佇列儲存體服務來訂閱輸入目錄中的檔案事件。File notification: Uses Azure Event Grid and Queue Storage services that subscribe to file events from the input directory. 自動載入器會自動設定 Azure 事件方格和佇列儲存體服務。Auto Loader automatically sets up the Azure Event Grid and Queue Storage services. 檔案通知模式更具效能,而且可針對大型輸入目錄進行擴充。File notification mode is more performant and scalable for large input directories. 若要使用此模式,您必須設定 Azure 事件方格和佇列儲存體服務的 許可權 ,並指定 .option("cloudFiles.useNotifications","true")To use this mode, you must configure permissions for the Azure Event Grid and Queue Storage services and specify .option("cloudFiles.useNotifications","true").

您可以在重新開機資料流程時變更模式。You can change mode when you restart the stream. 例如,當目錄清單因為輸入目錄大小增加而變得太慢時,您可能會想要切換至檔案通知模式。For example, you may want to switch to file notification mode when the directory listing is getting too slow due to the increase in input directory size. 在這兩種模式中,自動載入器會在內部追蹤哪些檔案已經過處理,以提供剛好一次的語法,因此您不需要自行管理任何狀態資訊。For both modes, Auto Loader internally keeps tracks of what files have been processed to provide exactly-once semantics, so you do not need to manage any state information yourself.

使用 cloudFiles 來源Use cloudFiles source

若要使用自動載入器,請使用與 cloudFiles 其他串流來源相同的方式來建立來源:To use the Auto Loader, create a cloudFiles source in the same way as other streaming sources:

PythonPython

df = spark.readStream.format("cloudFiles") \
  .option(<cloudFiles-option>, <option-value>) \
  .schema(<schema>) \
  .load(<input-path>)

df.writeStream.format("delta") \
  .option("checkpointLocation", <checkpoint-path>) \
  .start(<output-path>)

ScalaScala

val df = spark.readStream.format("cloudFiles")
  .option(<cloudFiles-option>, <option-value>)
  .schema(<schema>)
  .load(<input-path>)

df.writeStream.format("delta")
  .option("checkpointLocation", <checkpoint-path>)
  .start(<output-path>)

其中:where:

  • <cloudFiles-option>雲端資源管理中的設定選項。<cloudFiles-option> is a configuration option in Cloud resource management.

  • <schema> 這是檔案架構。<schema> is the file schema.

    注意

    在 Databricks Runtime 7.3 LTS 和更新版本上,如果檔案格式是 textbinaryFile 不需要提供架構,則為。On Databricks Runtime 7.3 LTS and above, if the file format is text or binaryFile you don’t need to provide the schema.

  • <input-path> 是 Azure Blob 儲存體中的路徑,以及針對新檔案監視的 Azure Data Lake Storage Gen1 和 Gen2。<input-path> is the path in Azure Blob storage and Azure Data Lake Storage Gen1 and Gen2 that is monitored for new files. <input-path>也會監視的子目錄。Subdirectories of <input-path> are also monitored. <input-path> 可以包含 file glob 模式。<input-path> can contain file glob patterns.

  • <checkpoint-path> 這是輸出資料流程檢查點位置。<checkpoint-path> is the output stream checkpoint location.

  • <output-path> 這是輸出資料流程路徑。<output-path> is the output stream path.

架構推斷和演進Schema inference and evolution

注意

可在 Databricks Runtime 8.2 和更新版本中使用。Available in Databricks Runtime 8.2 and above.

自動載入器支援使用 JSON、二進位 (binaryFile) 和文本檔案格式的架構推斷和演進。Auto Loader supports schema inference and evolution with JSON, binary (binaryFile), and text file formats. 如需詳細資訊,請參閱 自動載入器中的架構推斷和演進See Schema inference and evolution in Auto Loader for more details.

組態Configuration

來源專屬的設定選項會加上前置詞, cloudFiles cloudFiles 讓它們位於其他結構化串流來源選項的不同命名空間中。Configuration options specific to the cloudFiles source are prefixed with cloudFiles so that they are in a separate namespace from other Structured Streaming source options.

重要

某些預設選項值在 Databricks Runtime 7.2 中已變更。Some default option values changed in Databricks Runtime 7.2. 如果您在 Databricks Runtime 7.1 或更低的上使用自動載入器,請參閱 預設選項值和相容性的變更If you are using Auto Loader on Databricks Runtime 7.1 or below, see Changes in default option values and compatibility.

選項Option 類型Type 預設Default 描述Description
cloudFiles.cloudFiles. allowOverwritesallowOverwrites BooleanBoolean falsefalse 是否允許輸入目錄檔案變更以覆寫現有的資料。Whether to allow input directory file changes to overwrite existing data. 可在 Databricks Runtime 7.6 和更新版本中使用。Available in Databricks Runtime 7.6 and above.
cloudFiles.cloudFiles. fetchParallelismfetchParallelism 整數Integer 11 從佇列服務中提取訊息時要使用的執行緒數目。Number of threads to use when fetching messages from the queueing service.
cloudFiles。格式cloudFiles.format StringString 無 (必要選項) None (required option) 來源路徑中的 資料檔案格式The data file format in the source path.
cloudFiles.cloudFiles. includeExistingFilesincludeExistingFiles BooleanBoolean truetrue 在設定通知之後,是否要在串流處理中包含輸入路徑中的現有檔案,而不是只處理新的檔案。Whether to include existing files in the input path in the streaming processing versus only processing new files arrived after setting up the notifications. 只有當您第一次啟動資料流程時,才會遵守這個選項。This option is respected only when you start a stream for the first time. 在資料流程重新開機時變更其值不會有任何作用。Changing its value at stream restart won’t take any effect.
cloudFiles.cloudFiles. inferColumnTypesinferColumnTypes BooleanBoolean falsefalse 利用架構推斷時,是否要推斷確切的資料行類型。Whether to infer exact column types when leveraging schema inference. 根據預設,推斷 JSON 資料集時,會將資料行推斷為字串。By default, columns are inferred as strings when inferring JSON datasets. 如需詳細資訊,請參閱 架構推斷See schema inference for more details.
cloudFiles.cloudFiles. maxBytesPerTriggermaxBytesPerTrigger 位元組字串Byte String None 要在每個觸發程式中處理的最大新位元組數目。Maximum number of new bytes to be processed in every trigger. 您可以指定位元組字串,例如 10g ,將每個 microbatch 限制為 10 GB 的資料。You can specify a byte string such as 10g to limit each microbatch to 10 GB of data. 這是最大的軟。This is a soft maximum. 如果每個檔案都有 3 GB 的檔案,Azure Databricks 在 microbatch 中處理 12 GB。If you have files that are 3 GB each, Azure Databricks processes 12 GB in a microbatch. 搭配使用時 cloudFiles.maxFilesPerTrigger ,Azure Databricks 最高可耗用When used together with cloudFiles.maxFilesPerTrigger, Azure Databricks consumes up to the lower limit of
cloudFiles.maxFilesPerTrigger 或者 cloudFiles.maxBytesPerTrigger ,以先達到者為准。cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger, whichever is reached first. 搭配使用時,此選項不會有任何作用 Trigger.Once()This option has no effect when used with Trigger.Once().
cloudFiles.cloudFiles. maxFilesPerTriggermaxFilesPerTrigger 整數Integer 10001000 每個觸發程式中要處理的新檔案數目上限。Maximum number of new files to be processed in every trigger.
cloudFiles.cloudFiles. partitionColumnspartitionColumns StringString None 您要從 Hive 樣式目錄結構明確剖析的資料分割資料行。Partition columns that you want to explicitly parse from Hive style directory structures. 應提供為以逗號分隔的資料行清單。Should be provided as a comma-separated list of columns.
cloudFiles.cloudFiles. schemaEvolutionModeschemaEvolutionMode StringString 未提供架構時為 "addNewColumns",否則為 "none"。“addNewColumns” when a schema is not provided, “none” otherwise. 在資料中發現新資料行時,用來進化架構的模式。The mode for evolving the schema as new columns are discovered in the data. 如需詳細資訊,請參閱 架構演進See schema evolution for more details.
cloudFiles.cloudFiles. schemaHintsschemaHints StringString None 您在架構推斷期間提供給自動載入器的架構資訊。Schema information that you provide to Auto Loader during schema inference. 如需詳細資訊,請參閱 架構提示See schema hints for more details.
cloudFiles.cloudFiles. schemaLocationschemaLocation StringString 推斷架構時,無 (需要) None (required when inferring the schema) 儲存推斷的架構和後續變更的位置。A location to store the inferred schema and subsequent changes. 如需詳細資訊,請參閱 架構推斷See schema inference for more details.
cloudFiles.cloudFiles. useNotificationsuseNotifications BooleanBoolean falsefalse 是否要使用檔案通知模式來判斷是否有新檔案。Whether to use file notification mode to determine when there are new files. 如果為 false ,則使用目錄清單模式。If false, use directory listing mode. 請參閱檔案 探索模式See File discovery modes.
cloudFiles.cloudFiles. validateOptionsvalidateOptions BooleanBoolean truetrue 是否要驗證自動載入器選項,並傳回未知或不一致選項的錯誤。Whether to validate Auto Loader options and return an error for unknown or inconsistent options.

只有在您選擇下列選項時,才提供下列選項 cloudFiles.useNotifications = trueProvide the following options only if you choose cloudFiles.useNotifications = true:

選項Option 類型Type 預設Default 描述Description
cloudFiles connectionStringcloudFiles.connectionString StringString None 儲存體帳戶的連接字串,以帳戶存取金鑰或共用存取簽章 (SAS) 為基礎。The connection string for the storage account, based on either account access key or shared access signature (SAS). (1)(1)
cloudFilescloudFiles.resourceGroup StringString None 用來建立儲存體帳戶的 Azure 資源群組。The Azure Resource Group under which the storage account is created.
cloudFiles subscriptionIdcloudFiles.subscriptionId StringString None 用來建立資源群組的 Azure 訂用帳戶識別碼。The Azure Subscription ID under which the resource group is created.
cloudFilescloudFiles.tenantId StringString None 用來建立服務主體的 Azure 租使用者識別碼。The Azure Tenant ID under which the service principal is created.
cloudFiles clientIdcloudFiles.clientId StringString None 服務主體的用戶端識別碼或應用程式識別碼。The client ID or application ID of the service principal. (1)(1)
cloudFiles. clientSecretcloudFiles.clientSecret StringString None 服務主體的用戶端密碼。The client secret of the service principal.
cloudFiles. queueNamecloudFiles.queueName StringString None Azure 佇列的 URL。The URL of the Azure queue. 如果有提供,雲端檔案來源會直接從這個佇列取用事件,而不是設定自己的 Azure 事件方格和佇列儲存體服務。If provided, the cloud files source directly consumes events from this queue instead of setting up its own Azure Event Grid and Queue Storage services. 在此情況下,您的In that case, your
cloudFiles.connectionString 只需要佇列的讀取權限。cloudFiles.connectionString requires only read permissions on the queue.

(1) 請參閱 許可權(1) See Permissions.

注意

通知會保留在 Azure 訊息佇列中7天。Notifications are held in the Azure message queue for 7 days. 如果您停止資料流程,並在超過7天后重新開機,則會遺失訊息佇列中的通知。If you stop the stream and restart after more than 7 days, you lose the notifications in the message queue. 當通知停止時,Azure Databricks 會回復至目錄清單模式,並處理來自資料流程停止點的檔案;資料不會遺失。While the notifications are stopped, Azure Databricks falls back to directory listing mode and processes files from the point where the stream stopped; there is no data loss. 但是,這可能需要一些時間,而且效能會變慢,直到 Azure Databricks 趕上資料流程目前的狀態為止。However, this might take some time and performance will be slow until Azure Databricks catches up to the current state of the stream.

預設選項值和相容性 的變更 Changes in default option values and compatibility

下列自動載入器選項的預設值會在 Databricks Runtime 7.2 變更為 [ 雲端資源管理] 中所列的值。The default values of the following Auto Loader options changed in Databricks Runtime 7.2 to the values listed in Cloud resource management.

  • cloudFiles.useNotifications
  • cloudFiles.includeExistingFiles
  • cloudFiles.validateOptions

自動載入器串流 Databricks Runtime 7.1 和以下的預設選項值為開頭:Auto Loader streams started on Databricks Runtime 7.1 and below have the following default option values:

  • cloudFiles.useNotificationstruecloudFiles.useNotifications is true
  • cloudFiles.includeExistingFilesfalsecloudFiles.includeExistingFiles is false
  • cloudFiles.validateOptionsfalsecloudFiles.validateOptions is false

為了確保與現有應用程式的相容性,當您在 Databricks Runtime 7.2 或更新版本上執行現有的自動載入器資料流程時,這些預設選項值不會變更;在升級之後,資料流程將會有相同的行為。To ensure compatibility with existing applications, these default option values do not change when you run your existing Auto Loader streams on Databricks Runtime 7.2 or above; the streams will have the same behavior after the upgrade.

計量Metrics

自動載入器會在每個批次中報告度量。Auto Loader reports metrics at every batch. 您可以在 numFilesOutstanding numBytesOutstanding [串流查詢進度] 儀表板的 [原始資料] 索引標籤底下,查看待處理專案中有多少檔案,以及待處理專案的大小和度量:You can view how many files exist in the backlog and how large the backlog is in the numFilesOutstanding and numBytesOutstanding metrics under the Raw Data tab in the streaming query progress dashboard:

{
  "sources" : [
    {
      "description" : "CloudFilesSource[/path/to/source]",
      "metrics" : {
        "numFilesOutstanding" : "238",
        "numBytesOutstanding" : "163939124006"
      },
    }
  ]
}

權限Permissions

您必須擁有輸入目錄的讀取權限。You must have read permissions for the input directory. 請參閱 Azure Blob 儲存體Azure Data Lake Storage Gen2See Azure Blob Storage and Azure Data Lake Storage Gen2.

若要使用檔案通知模式,您必須提供用來設定和存取事件通知服務的驗證認證。To use file notification mode, you must provide authentication credentials for setting up and accessing the event notification services. 在 Databricks Runtime 8.1 和更新版本中,您只需要服務主體來進行驗證。In Databricks Runtime 8.1 and above, you only need a service principal for authentication. 針對 Databricks Runtime 8.0 和以下的,您必須同時提供服務主體和連接字串。For Databricks Runtime 8.0 and below, you must provide both a service principal and a connection string.

  • 服務主體Service principal

    以用戶端識別碼和用戶端密碼的形式建立 Azure Active Directory 應用程式和服務主體Create an Azure Active Directory app and service principal in the form of client ID and client secret. 您必須將下列角色指派給輸入路徑所在的儲存體帳戶:You must assign this app the following roles to the storage account in which the input path resides:

    • 參與者:此角色是用來設定儲存體帳戶中的資源,例如佇列和事件訂閱。Contributor: This role is for setting up resources in your storage account, such as queues and event subscriptions.
    • EventGrid EventSubscription 參與者:此角色適用于執行事件方格訂用帳戶作業,例如建立或列出事件訂閱。EventGrid EventSubscription Contributor: This role is for performing event grid subscription operations such as creating or listing event subscriptions.
    • 儲存體佇列資料參與者:此角色是用來執行佇列作業,例如,從佇列中取出和刪除訊息。Storage Queue Data Contributor: This role is for performing queue operations such as retrieving and deleting messages from the queues. 只有當您在沒有連接字串的情況下提供服務主體時,才需要在 Databricks Runtime 8.1 版和更新版本中使用此角色。This role is required in Databricks Runtime version 8.1 and above only when you provide a service principal without a connection string.
  • 連接字串Connection string

    自動載入器需要 連接字串 來驗證 Azure 佇列儲存體作業,例如建立佇列以及從佇列讀取和刪除訊息。Auto Loader requires a connection string to authenticate for Azure Queue Storage operations, such as creating a queue and reading and deleting messages from the queue. 在輸入目錄路徑所在的相同儲存體帳戶中建立佇列。The queue is created in the same storage account where the input directory path is located. 您可以在 帳戶金鑰 或共用存取簽章中找到您的連接字串 (SAS) You can find your connection string in your account key or shared access signature (SAS).

    • 如果您使用 Databricks Runtime 8.1 或更高版本,則不需要連接字串。If you are using Databricks Runtime 8.1 or above, you do not need a connection string.

    • 如果您使用 Databricks Runtime 8.0 或更低版本,則必須提供 連接字串 來驗證 Azure 佇列儲存體作業,例如建立佇列,以及從佇列中取得和刪除訊息。If you are using Databricks Runtime 8.0 or lower, you must provide a connection string to authenticate for Azure Queue Storage operations, such as creating a queue and retrieving and deleting messages from the queue. 在輸入路徑所在的相同儲存體帳戶中建立佇列。The queue is created in the same storage account in which the input path resides. 您可以在 帳戶金鑰 或共用存取簽章中找到您的連接字串 (SAS) You can find your connection string in your account key or shared access signature (SAS). 設定 SAS 權杖時,您必須提供下列許可權:When configuring an SAS token, you must provide the following permissions:

      自動載入器許可權

      注意

      如果您沒有建立資源的必要許可權,您可以要求系統管理員使用 雲端資源管理 SCALA API來執行安裝程式。If you do not have the necessary permissions to create resources, you can ask an administrator to perform setup using the Cloud resource management Scala API. 系統管理員可以為您提供佇列名稱,您可以直接指定該名稱作為 .option("cloudFiles.queueName", <queue-name>) cloudFiles 來源。An administrator can provide you with the queue name, which you can specify directly as .option("cloudFiles.queueName", <queue-name>) to the cloudFiles source.

疑難排解Troubleshooting

錯誤:Error:

java.lang.RuntimeException: Failed to create event grid subscription.

如果您第一次執行自動載入器時看到此錯誤訊息,則事件方格未在您的 Azure 訂用帳戶中註冊為資源提供者。If you see this error message when you run Auto Loader for the first time, the Event Grid is not registered as a Resource Provider in your Azure subscription. 若要在 Azure 入口網站上進行註冊:To register this on Azure portal:

  1. 移至您的訂用帳戶。Go to your subscription.
  2. 在 [設定] 區段下,按一下 [ 資源提供者 ]。Click Resource Providers under the Settings section.
  3. 註冊提供者 Microsoft.EventGridRegister the provider Microsoft.EventGrid.

錯誤:Error:

403 Forbidden ... does not have authorization to perform action 'Microsoft.EventGrid/eventSubscriptions/[read|write]' over scope ...

如果您第一次執行自動載入器時看到此錯誤訊息,請確定您已將「 參與者 」角色提供給事件方格和儲存體帳戶的服務主體。If you see this error message when you run Auto Loader for the first time, ensure you have given the Contributor role to your service principal for Event Grid as well as your storage account.

雲端資源管理Cloud resource management

您可以使用 Scala API 來管理自動載入器所建立的 Azure 事件方格和佇列儲存體服務。You can use a Scala API to manage the Azure Event Grid and Queue Storage services created by Auto Loader. 在使用此 API 之前,您必須先設定 許可權 中所述的資源安裝程式許可權。You must configure the resource setup permissions described in Permissions before using this API.

import com.databricks.sql.CloudFilesAzureResourceManager
val manager = CloudFilesAzureResourceManager
  .newManager
  .option("cloudFiles.connectionString", <connection-string>)
  .option("cloudFiles.resourceGroup", <resource-group>)
  .option("cloudFiles.subscriptionId", <subscription-id>)
  .option("cloudFiles.tenantId", <tenant-id>)
  .option("cloudFiles.clientId", <service-principal-client-id>)
  .option("cloudFiles.clientSecret", <service-principal-client-secret>)
  .option("path", <path-to-specific-container-and-folder>) // required only for setUpNotificationServices
  .create()

// Set up an AQS queue and an event grid subscription associated with the path used in the manager. Available in Databricks Runtime 7.4 and above.
manager.setUpNotificationServices(<resource-suffix>)

// List notification services created by Auto Loader
manager.listNotificationServices()

// Tear down the notification services created for a specific stream ID.
// Stream ID is a GUID string that you can find in the list result above.
manager.tearDownNotificationServices(<stream-id>)

注意

可在 Databricks Runtime 7.4 和更新版本中使用。Available in Databricks Runtime 7.4 and above.

使用 setUpNotificationServices(<resource-suffix>) 來建立佇列和名稱為的事件方格訂用帳戶 <resource-prefix><resource-suffix>Use setUpNotificationServices(<resource-suffix>) to create a Queue and an Event Grid Subscription with the name <resource-prefix><resource-suffix>. 如果存在具有相同名稱的現有佇列或事件方格訂用帳戶,Azure Databricks 會重複使用已存在的資源,而不是建立新的資源。If there is an existing Queue or Event Grid Subscription with the same name, Azure Databricks reuses the resource that already exists instead of creating a new one. 此函式會傳回您可以使用傳遞至 cloudFiles 來源的佇列 .option("cloudFiles.queueName", <queue-name>)This function returns a Queue that you can pass to the cloudFiles source using .option("cloudFiles.queueName", <queue-name>). 這可讓 cloudFiles 來源使用者擁有比建立資源的使用者更少的許可權。This enables the cloudFiles source user to have fewer permissions than the user who creates the resources. 請參閱 許可權See Permissions.

只在 "path" newManager 呼叫時提供選項 setUpNotificationServices ; 或不需要 listNotificationServices tearDownNotificationServicesProvide the "path" option to newManager only if calling setUpNotificationServices; it is not needed for listNotificationServices or tearDownNotificationServices. 這與您在 path 執行串流查詢時所使用的相同。This is the same path that you use when running a streaming query.

常見問題集 (FAQ)Frequently asked questions (FAQ)

我是否需要事先建立 Azure 事件通知服務?Do I need to create Azure event notification services beforehand?

不會。No. 如果您選擇 [檔案通知模式],自動載入器會在您啟動串流時,自動建立 Azure Blob 儲存體,並 Azure Data Lake Storage Gen1 和 Gen2 > 事件方格訂用帳戶 > 佇列檔案事件通知管線。If you choose file notification mode, Auto Loader creates an Azure Blob storage and Azure Data Lake Storage Gen1 and Gen2 > Event Grid Subscription > Queue file event notification pipeline automatically when you start the stream.

如何? 清除自動載入器所建立的事件通知資源,例如事件方格訂用帳戶和佇列?How do I clean up the event notification resources, such as Event Grid Subscriptions and Queues, created by Auto Loader?

您可以使用 雲端資源管理員 來列出和卸載資源。You can use the cloud resource manager to list and tear down resources. 您也可以在入口網站中或使用 Azure Api 手動刪除這些資源。You can also delete these resources manually, either in the Web Portal or using Azure APIs. 自動載入器所建立的所有資源都有首碼: <resource-prefix>All resources created by Auto Loader have the prefix: <resource-prefix>.

當附加或覆寫檔案時,自動載入器會再次處理檔案嗎?Does Auto Loader process the file again when the file gets appended or overwritten?

除非已啟用,否則只會處理檔案一次 cloudFiles.allowOverwritesFiles are processed exactly once unless cloudFiles.allowOverwrites is enabled. 如果檔案已附加或覆寫,Azure Databricks 不保證會處理哪個版本的檔案。If a file is appended to or overwritten, Azure Databricks does not guarantee which version of the file is processed. 針對妥善定義的行為,Databricks 建議您使用自動載入器,只內嵌不可變的檔案。For well-defined behavior, Databricks suggests that you use Auto Loader to ingest only immutable files. 如果這不符合您的需求,請洽詢您的 Databricks 代表。If this does not meet your requirements, contact your Databricks representative.

我可以從相同的輸入目錄執行多個串流查詢嗎?Can I run multiple streaming queries from the same input directory?

是。Yes. 每個雲端檔案串流(由唯一的檢查點目錄識別)都有自己的佇列,而且相同的 Azure Blob 儲存體和 Azure Data Lake Storage Gen1 和 Gen2 事件可以傳送至多個佇列。Each cloud files stream, as identified by a unique checkpoint directory, has its own Queue, and the same Azure Blob storage and Azure Data Lake Storage Gen1 and Gen2 events can be sent to multiple Queues.

如果我的資料檔不會持續抵達,但以週期性間隔(例如一天一次),是否仍應使用此來源,並有任何好處?If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?

是,是的。Yes and yes. 在此情況下,您可以設定 Trigger-Once 結構化串流作業,並排程在預期的檔案抵達時間之後執行。In this case, you can set up a Trigger-Once Structured Streaming job and schedule to run after the anticipated file arrival time. 第一次執行會設定將永遠開啟的事件通知服務,即使串流叢集已關閉也是如此。The first run sets up the event notification services, which will be always on, even when the streaming cluster is down. 當您重新開機資料流程時, cloudFiles 來源會提取並處理佇列中所備份的所有檔案事件。When you restart the stream, the cloudFiles source fetches and processes all files events backed up in the Queue. 在此情況下使用自動載入器的優點是,您不需要判斷哪些檔案是新的,而且每次都要處理,這可能會很耗費資源。The benefit of using Auto Loader for this case is that you don’t need to determine which files are new and to be processed each time, which can be very expensive.

當我在重新開機資料流程時變更檢查點位置時,會發生什麼事?What happens if I change the checkpoint location when restarting the stream?

檢查點位置會維護資料流程的重要識別資訊。A checkpoint location maintains important identifying information of a stream. 有效變更檢查點位置,表示您已放棄先前的資料流程並開始新的資料流程。Changing the checkpoint location effectively means that you have abandoned the previous stream and started a new stream. 新的資料流程會建立新的進度資訊,如果您使用的是檔案通知模式,則新的 Azure 事件方格和佇列儲存體服務。The new stream will create new progress information and if you are using file notification mode, new Azure Event Grid and Queue Storage services. 您必須針對任何已放棄的資料流程手動清除檢查點位置和 Azure 事件方格和佇列儲存體服務。You must manually clean up the checkpoint location and Azure Event Grid and Queue Storage services for any abandoned streams.