DBFS APIDBFS API

DBFS API 是一种 Databricks API,可让你轻松地与各种数据源交互,而不必在每次读取文件时都提供凭据。The DBFS API is a Databricks API that makes it simple to interact with various data sources without having to include your credentials every time you read a file. 有关详细信息,请参阅 Databricks 文件系统 (DBFS)See Databricks File System (DBFS) for more information. 有关 DBFS API 的易用命令行客户端,请参阅 Databricks CLIFor an easy to use command line client of the DBFS API, see Databricks CLI.

备注

为了确保在负载较高的情况下也能提供高质量的服务,Azure Databricks 现在正针对 DBFS API 调用强制实施 API 速率限制。To ensure high quality of service under heavy load, Azure Databricks is now enforcing API rate limits for DBFS API calls. 限制按工作区设置,以确保公平使用和高可用性。Limits are set per workspace to ensure fair usage and high availability. 如果使用 Databricks CLI 0.12.0 及更高版本,可以进行自动重试。Automatic retries are available using Databricks CLI version 0.12.0 and above. 建议所有客户切换到最新的 Databricks CLI 版本。We advise all customers to switch to the latest Databricks CLI version.

重要

要访问 Databricks REST API,必须进行身份验证To access Databricks REST APIs, you must authenticate.

添加块 Add block

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/add-block POST

将数据块追加到由输入句柄指定的流。Append a block of data to the stream specified by the input handle. 如果该句柄不存在,则此调用会引发异常,并返回 RESOURCE_DOES_NOT_EXISTIf the handle does not exist, this call will throw an exception with RESOURCE_DOES_NOT_EXIST. 如果数据块超出 1 MB,则此调用会引发异常,并返回 MAX_BLOCK_SIZE_EXCEEDEDIf the block of data exceeds 1 MB, this call will throw an exception with MAX_BLOCK_SIZE_EXCEEDED. 请求示例:Example of request:

{
  "data": "ZGF0YWJyaWNrcwo=",
  "handle": 7904256
}

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
句柄handle INT64 打开的流上的句柄。The handle on an open stream. 此字段为必需字段。This field is required.
数据data BYTES 要追加到流的 base64 编码数据。The base64-encoded data to append to the stream. 此项的限制为 1 MB。This has a limit of 1 MB. 此字段为必需字段。This field is required.

关闭 Close

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/close POST

关闭由输入句柄指定的流。Close the stream specified by the input handle. 如果该句柄不存在,则此调用会引发异常,并返回 RESOURCE_DOES_NOT_EXISTIf the handle does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
句柄handle INT64 打开的流上的句柄。The handle on an open stream. 此字段为必需字段。This field is required.

创建 Create

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/create POST

打开流以将内容写入文件,并返回此流的句柄。Open a stream to write to a file and returns a handle to this stream. 此句柄上有一个 10 分钟的空闲超时。There is a 10 minute idle timeout on this handle. 如果文件或目录已存在于给定路径中,并且 overwrite 设置为 false,则此调用会引发异常,并返回 RESOURCE_ALREADY_EXISTSIf a file or directory already exists on the given path and overwrite is set to false, this call throws an exception with RESOURCE_ALREADY_EXISTS. 文件上传的典型工作流将如下所述:A typical workflow for file upload would be:

  1. 发出 create 调用并获取句柄。Issue a create call and get a handle.
  2. 使用你有的句柄发出一个或多个 add-block 调用。Issue one or more add-block calls with the handle you have.
  3. 使用你有的句柄发出一个 close 调用。Issue a close call with the handle you have.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 新文件的路径。The path of the new file. 路径应为绝对 DBFS 路径(例如 /mnt/foo.txt)。The path should be the absolute DBFS path (e.g. /mnt/foo.txt). 此字段为必需字段。This field is required.
overwriteoverwrite BOOL 一个标志,用于指定是否覆盖现有文件。The flag that specifies whether to overwrite existing file or files.

响应结构 Response structure

字段名称Field Name 类型Type 描述Description
句柄handle INT64 一个句柄,该句柄随后应该传递到 AddBlock,并在通过流写入到文件时关闭调用。Handle which should subsequently be passed into the AddBlock and Close calls when writing to a file through a stream.

删除 Delete

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/delete POST

删除文件或目录(可以选择以递归方式删除目录中的所有文件)。Delete the file or directory (optionally recursively delete all files in the directory). 如果路径为非空目录且 recursive 设置为 false,或出现其他类似错误,则此调用会引发异常,并返回 IO_ERRORThis call throws an exception with IO_ERROR if the path is a non-empty directory and recursive is set to false or on other similar errors.

删除大量文件时,删除操作以增量方式执行。When you delete a large number of files, the delete operation is done in increments. 此调用在大约 45 秒后返回响应,并出现一条错误消息(503 服务不可用),要求你重新调用删除操作,直至完全删除目录结构。The call returns a response after approximately 45s with an error message (503 Service Unavailable) asking you to re-invoke the delete operation until the directory structure is fully deleted. 例如:For example:

{
  "error_code":"PARTIAL_DELETE","message":"The requested operation has deleted 324 files. There are more files remaining. You must make another request to delete more."
}

对于删除 1 万个以上文件的操作,我们建议不要使用 DBFS REST API,而是使用文件系统实用工具在群集上下文中执行此类操作。For operations that delete more than 10k files, we discourage using the DBFS REST API, but advise you to perform such operations in the context of a cluster, using File system utilities. dbutils.fs 涵盖 DBFS REST API 的功能范围,但仅限在笔记本内部。dbutils.fs covers the functional scope of the DBFS REST API, but from notebooks. 使用笔记本运行此类操作可提供更好的控制和可管理性(例如,选择性删除),并可自动执行定期的删除作业。Running such operations using notebooks provides better control and manageability, such as selective deletes, and the possibility to automate periodic delete jobs.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 要删除的文件或目录的路径。The path of the file or directory to delete. 路径应为绝对 DBFS 路径(例如 /mnt/foo/)。The path should be the absolute DBFS path (e.g. /mnt/foo/). 此字段为必需字段。This field is required.
recursiverecursive BOOL 是否以递归方式删除目录的内容。Whether or not to recursively delete the directory’s contents. 无需提供递归标志即可删除空目录。Deleting empty directories can be done without providing the recursive flag.

获取状态 Get status

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/get-status GET

获取文件或目录的文件信息。Get the file information of a file or directory. 如果该文件或目录不存在,则此调用会引发异常,并返回 RESOURCE_DOES_NOT_EXISTIf the file or directory does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 文件或目录的路径。The path of the file or directory. 路径应为绝对 DBFS 路径(例如 /mnt/foo/)。The path should be the absolute DBFS path (e.g. /mnt/foo/). 此字段为必需字段。This field is required.

响应结构 Response structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 文件或目录的路径。The path of the file or directory.
is_diris_dir BOOL 如果路径是目录,则此项的值为 true。True if the path is a directory.
file_sizefile_size INT64 文件的长度(以字节为单位);如果路径是目录,则此项的值为零。The length of the file in bytes or zero if the path is a directory.

列出 List

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/list GET

列出目录的内容或文件的详细信息。List the contents of a directory, or details of the file. 如果该文件或目录不存在,则此调用会引发异常,并返回 RESOURCE_DOES_NOT_EXISTIf the file or directory does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST.

在大型目录中调用 list 时,list 操作会在大约 60 秒后超时。When calling list on a large directory, the list operation will time out after approximately 60s. 强烈建议仅在包含的文件数小于 1 万的目录上使用 list,不要将 DBFS REST API 用于执行会列出 1 万个以上文件的操作。We strongly recommend using list only on directories containing less than 10K files and discourage using the DBFS REST API for operations that list more than 10k files. 我们建议你使用文件系统实用工具在群集上下文中执行此类操作,该实用工具提供相同的功能,但没有超时。Instead, we recommend that you perform such operations in the context of a cluster, using File system utilities, which provides the same functionality without timing out.

回复示例:Example of reply:

{
  "files": [
    {
      "path": "/a.cpp",
      "is_dir": false,
      "file_size": 261
    },
    {
      "path": "/databricks-results",
      "is_dir": true,
      "file_size": 0
    }
  ]
}

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 文件或目录的路径。The path of the file or directory. 路径应为绝对 DBFS 路径(例如 /mnt/foo/)。The path should be the absolute DBFS path (e.g. /mnt/foo/). 此字段为必需字段。This field is required.

响应结构 Response structure

字段名称Field Name 类型Type 描述Description
文件files FileInfo 的数组An array of FileInfo FileInfo 列表,用于描述目录或文件的内容。A list of FileInfo that describe contents of directory or file.

Mkdirs Mkdirs

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/mkdirs POST

创建给定目录和必要的父目录(如果不存在)。Create the given directory and necessary parent directories if they do not exist. 如果在输入路径的任何前缀处存在一个文件(而不是目录),则此调用会引发异常,并返回 RESOURCE_ALREADY_EXISTSIf there exists a file (not a directory) at any prefix of the input path, this call throws an exception with RESOURCE_ALREADY_EXISTS. 如果此操作失败,则可能已成功创建了一些必需的父目录。If this operation fails it may have succeeded in creating some of the necessary parent directories.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 新目录的路径。The path of the new directory. 路径应为绝对 DBFS 路径(例如 /mnt/foo/)。The path should be the absolute DBFS path (e.g. /mnt/foo/). 此字段为必需字段。This field is required.

移动 Move

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/move POST

在 DBFS 中将文件从一个位置移到另一个位置。Move a file from one location to another location within DBFS. 如果源文件不存在,则此调用会引发异常,并返回 RESOURCE_DOES_NOT_EXISTIf the source file does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST. 如果目标路径中已经存在一个文件,则此调用会引发异常,并返回 RESOURCE_ALREADY_EXISTSIf there already exists a file in the destination path, this call throws an exception with RESOURCE_ALREADY_EXISTS. 如果给定的源路径是一个目录,则此调用始终会以递归方式移动所有文件。If the given source path is a directory, this call always recursively moves all files.

移动大量文件时,API 调用会在大约 60 秒后超时,这可能会导致只有一部分数据被移动。When moving a large number of files the API call will time out after approximately 60s, potentially resulting in partially moved data. 因此,对于移动 1 万个以上文件的操作,我们强烈建议不要使用 DBFS REST API。Therefore, for operations that move more than 10k files, we strongly discourage using the DBFS REST API. 我们建议你使用笔记本的文件系统实用工具在群集上下文中执行此类操作,该实用工具提供相同的功能,但没有超时。Instead, we recommend that you perform such operations in the context of a cluster, using File system utilities from a notebook, which provides the same functionality without timing out.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
source_pathsource_path STRING 文件或目录的源路径。The source path of the file or directory. 路径应为绝对 DBFS 路径(例如 /mnt/foo/)。The path should be the absolute DBFS path (e.g. /mnt/foo/). 此字段为必需字段。This field is required.
destination_pathdestination_path STRING 文件或目录的目标路径。The destination path of the file or directory. 路径应为绝对 DBFS 路径(例如 /mnt/bar/)。The path should be the absolute DBFS path (e.g. /mnt/bar/). 此字段为必需字段。This field is required.

放置 Put

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/put POST

通过使用“多部分表单 POST”来上传文件。Upload a file through the use of multipart form post. 它主要用于流式上传,但也可用作方便的单个调用来上传数据。It is mainly used for streaming uploads, but can also be used as a convenient single call for data upload. 用法示例:Example usage:

在以下示例中,请将 <databricks-instance> 替换为 Azure Databricks 部署的工作区 URLIn the following examples, replace <databricks-instance> with the workspace URL of your Azure Databricks deployment.

curl -F contents=@localsrc -F path="PATH" https://<databricks-instance>/api/2.0/dbfs/put

localsrc 是要上传的本地文件的路径,并且只有“多部分表单 POST”(即,将 -F `` or ``--formcurl 结合使用)支持这种用法。localsrc is the path to a local file to upload and this usage is supported only with multipart form post (i.e. using -F `` or ``--form with curl).

也可将内容作为 base64 字符串传递。Alternatively you can pass contents as a base64 string. 示例:Examples:

curl -F contents="BASE64" -F path="PATH" https://<databricks-instance>/api/2.0/dbfs/put
curl  -H "Content-Type: application/json" -d '{"path":"PATH","contents":"BASE64"}' https://<databricks-instance>/api/2.0/dbfs/put``

可以使用 contents(即非流式处理)参数传递的数据量限制为 1 MB;如果超出,则会引发 MAX_BLOCK_SIZE_EXCEEDEDThe amount of data that can be passed using contents (i.e. not streaming) parameter is limited to 1 MB; MAX_BLOCK_SIZE_EXCEEDED is thrown if exceeded. 如果要上传大文件,请使用流式上传。Use streaming upload if you want to upload large files. 有关详细信息,请参阅创建添加块关闭See Create, Add block, and Close for details.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 新文件的路径。The path of the new file. 路径应为绝对 DBFS 路径(例如 /mnt/foo/)。The path should be the absolute DBFS path (e.g. /mnt/foo/). 此字段为必需字段。This field is required.
内容contents BYTES 此参数可能不存在,将会改用已发布的文件。This parameter might be absent, and instead a posted file will be used.
overwriteoverwrite BOOL 一个标志,用于指定是否覆盖现有文件。The flag that specifies whether to overwrite existing files.

读取 Read

端点Endpoint HTTP 方法HTTP Method
2.0/dbfs/read GET

返回文件的内容。Return the contents of a file. 如果文件不存在,则此调用会引发异常,并返回 RESOURCE_DOES_NOT_EXISTIf the file does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST. 如果路径是目录,则读取长度为负数;如果偏移量为负,则此调用会引发异常,并返回 INVALID_PARAMETER_VALUEIf the path is a directory, the read length is negative, or if the offset is negative, this call throws an exception with INVALID_PARAMETER_VALUE. 如果读取长度超出 1 MB,则此调用会引发异常,并返回 MAX_READ_SIZE_EXCEEDEDIf the read length exceeds 1 MB, this call throws an exception with MAX_READ_SIZE_EXCEEDED. 如果 offset + length 超出文件中的字节数,则读取内容,直到文件结尾。If offset + length exceeds the number of bytes in a file, reads contents until the end of file.

请求结构 Request structure

字段名称Field Name 类型Type 描述Description
pathpath STRING 要读取的文件的路径。The path of the file to read. 路径应为绝对 DBFS 路径(例如 /mnt/foo/)。The path should be the absolute DBFS path (e.g. /mnt/foo/). 此字段为必需字段。This field is required.
offsetoffset INT64 要从其开始读取的偏移量(以字节为单位)。The offset to read from in bytes.
lengthlength INT64 要从该偏移量开始读取的字节数。The number of bytes to read starting from the offset. 其限制为 1 MB,默认值为 0.5 MB。This has a limit of 1 MB, and a default value of 0.5 MB.

响应结构 Response structure

字段名称Field Name 类型Type 描述Description
bytes_readbytes_read INT64 读取的字节数(可能会小于 length,如果遇到文件结尾的话)。The number of bytes read (could be less than length if we hit end of file). 这是指在未编码的版本中读取的字节数(响应数据采用 base64 编码)。This refers to number of bytes read in unencoded version (response data is base64-encoded).
数据data BYTES 读取的文件的 base64 编码内容。The base64-encoded contents of the file read.

数据结构 Data structures

本节内容:In this section:

FileInfo FileInfo

存储文件或目录的属性。Store the attributes of a file or directory.

字段名称Field Name 类型Type 描述Description
pathpath STRING 文件或目录的路径。The path of the file or directory.
is_diris_dir BOOL 如果路径是目录,则此项的值为 true。True if the path is a directory.
file_sizefile_size INT64 文件的长度(以字节为单位);如果路径是目录,则此项的值为零。The length of the file in bytes or zero if the path is a directory.