Azure Databricks 概念 Azure Databricks concepts

本文介绍了需要了解的基本概念集,以便有效地使用 Azure Databricks 工作区。This article introduces the set of fundamental concepts you need to understand in order to use Azure Databricks Workspace effectively.

工作区Workspace

工作区是一种用于访问所有 Azure Databricks 资产的环境。The workspace is an environment for accessing all of your Azure Databricks assets. 工作区将 (笔记本、库、仪表板和试验) 对象组织到 文件夹 中,并提供对数据对象和计算资源的访问权限。The workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.

本部分介绍 Azure Databricks 工作区文件夹中包含的对象。This section describes the objects contained in the Azure Databricks workspace folders.

笔记本Notebook

一个基于 web 的界面,其中包含可运行的命令、可视化效果和叙述文本。A web-based interface to documents that contain runnable commands, visualizations, and narrative text.

Dashboard

提供对可视化对象的组织访问的接口。An interface that provides organized access to visualizations.

类库Library

可用于在群集上运行的笔记本或作业的代码包。A package of code available to the notebook or job running on your cluster. Databricks 运行时包含许多库,你可以添加自己的库。Databricks runtimes include many libraries and you can add your own.

试验Experiment

用于训练机器学习模型的 MLflow 的集合。A collection of MLflow runs for training a machine learning model.

接口Interface

本部分介绍 Azure Databricks 支持用于访问您的资产的接口: UI、API 和命令行 (CLI) 。This section describes the interfaces that Azure Databricks supports for accessing your assets: UI, API, and command-line (CLI).

UIUI

Azure Databricks UI 提供了一个易于使用的图形界面,可用于工作区文件夹及其包含的对象、数据对象和计算资源。The Azure Databricks UI provides an easy-to-use graphical interface to workspace folders and their contained objects, data objects, and computational resources.

登陆页面Landing page

REST APIREST API

REST API 有两个版本: REST API 2.0REST API 1.2There are two versions of the REST API: REST API 2.0 and REST API 1.2. REST API 2.0 支持 REST API 1.2 的大部分功能以及其他功能,并且是首选的。The REST API 2.0 supports most of the functionality of the REST API 1.2, as well as additional functionality and is preferred.

CLICLI

驻留在 GitHub上的开源项目。An open source project hosted on GitHub. CLI 是在 REST API 2.0的基础上构建的。The CLI is built on top of the REST API 2.0.

数据管理Data management

本部分介绍了一些对象,这些对象包含在其上执行分析和源到机器学习算法的数据。This section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms.

Databricks 文件系统 (DBFS)Databricks File System (DBFS)

Blob 存储区上的文件系统抽象层。A filesystem abstraction layer over a blob store. 它包含目录,其中可以包含文件 (数据文件、库和图像) 和其他目录。It contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS 会自动填充一些数据集,您可以使用这些 数据集 来了解 Azure Databricks。DBFS is automatically populated with some datasets that you can use to learn Azure Databricks.

数据Database

组织信息的集合,以便可以轻松地对其进行访问、管理和更新。A collection of information that is organized so that it can be easily accessed, managed, and updated.

TableTable

结构化数据的表示形式。A representation of structured data. 查询包含 Apache Spark SQL 和 Apache Spark Api 的表。You query tables with Apache Spark SQL and Apache Spark APIs.

元存储Metastore

在数据仓库中存储各种表和分区的所有结构信息的组件,包括列和列类型信息、读取和写入数据所需的序列化程序和反,以及存储数据的相应文件。The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. 每个 Azure Databricks 部署都有一个中心 Hive 元存储,供所有需要保存表元数据的群集访问。Every Azure Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. 你还可以选择使用现有的 外部 Hive 元存储You also have the option to use an existing external Hive metastore.

计算管理Computation management

本部分介绍在 Azure Databricks 中运行计算时需要了解的概念。This section describes concepts that you need to know to run computations in Azure Databricks.

聚集Cluster

一组计算资源和用于运行笔记本和作业的配置。A set of computation resources and configurations on which you run notebooks and jobs. 有两种类型的群集: "所有用途" 和 "作业"。There are two types of clusters: all-purpose and job.

  • 你可以使用 UI、CLI 或 REST API 来创建一个 " 全部用途" 群集You create an all-purpose cluster using the UI, CLI, or REST API. 可手动终止和重启通用群集。You can manually terminate and restart an all-purpose cluster. 多个用户可以共享此类群集,以协作的方式执行交互式分析。Multiple users can share such clusters to do collaborative interactive analysis.
  • 当你在_新作业群集_上运行作业时,Azure Databricks 作业计划程序将创建_作业群集_,并在作业完成后终止群集。The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. _无法_重新启动作业群集。You cannot restart an job cluster.

Pool

一组空闲的、随时可用的实例,可减少群集开始和自动缩放时间。A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. 当附加到池时,群集将从池中分配其驱动程序和辅助角色节点。When attached to a pool, a cluster allocates its driver and worker nodes from the pool. 如果池没有足够的空闲资源来容纳群集的请求,则池将通过从实例提供程序分配新实例进行扩展。If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. 终止附加的群集后,它所使用的实例将返回到池,并可由其他群集重复使用。When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

Databricks 运行时Databricks runtime

在 Azure Databricks 管理的群集上运行的核心组件集。The set of core components that run on the clusters managed by Azure Databricks. Azure Databricks 提供多种类型的运行时:Azure Databricks offers several types of runtimes:

  • Databricks Runtime 包括 Apache Spark 但还添加了大量组件和更新,它们可显著提高大数据分析的可用性、性能和安全性。Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
  • 机器学习的 Databricks Runtime 是在 Databricks Runtime 上构建的,并为机器学习和数据科学提供随时可用的环境。Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. 它包含多个流行库,其中包括 TensorFlow、Keras、PyTorch 和 XGBoost。It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
  • 基因组学的 Databricks Runtime 是一种针对使用基因组和生物医学数据而优化的 Databricks Runtime 版本。Databricks Runtime for Genomics is a version of Databricks Runtime optimized for working with genomic and biomedical data.
  • Databricks Light 是开放源代码 Apache Spark 运行时 Azure Databricks 打包。Databricks Light is the Azure Databricks packaging of the open source Apache Spark runtime. 它为不需要 Databricks Runtime 所提供的高级性能、可靠性或自动缩放优势的作业提供运行时选项。It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime. 仅当创建运行 JAR、Python 或 spark-submit 作业的群集时,才可以选择 Databricks Light;对于要在其上运行交互式或笔记本作业工作负荷的群集,不能选择此运行时。You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or notebook job workloads.

作业Job

用于立即或按计划运行笔记本或库的非交互式机制。A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.

“工作负荷”Workload

Azure Databricks 标识了不同 定价 方案的两种类型的工作负荷:数据工程 (作业) 和数据分析 (所有用途) 。Azure Databricks identifies two types of workloads subject to different pricing schemes: data engineering (job) and data analytics (all-purpose).

  • 数据工程 (自动) 工作负荷在 Azure Databricks 作业计划程序为每个工作负荷创建的 作业群集 上运行。Data engineering An (automated) workload runs on a job cluster which the Azure Databricks job scheduler creates for each workload.
  • 数据分析 (交互式) 工作负荷在 _所有用途的群集_上运行。Data analytics An (interactive) workload runs on an all-purpose cluster. 交互式工作负荷通常在 Azure Databricks 笔记本中运行命令。Interactive workloads typically run commands within an Azure Databricks notebook. 但是,在_现有的所有用途_群集上运行_作业_也被视为交互式工作负荷。However, running a job on an existing all-purpose cluster is also treated as an interactive workload.

执行上下文Execution context

每种受支持的编程语言的 复制 环境的状态。The state for a REPL environment for each supported programming language. 支持的语言包括 Python、R、Scala 和 SQL。The languages supported are Python, R, Scala, and SQL.

模型管理Model management

本部分介绍了在训练机器学习模型时需要了解的概念。This section describes concepts that you need to know to train machine learning models.

建模Model

一个数学函数,该函数表示一组预测值与结果之间的关系。A mathematical function that represents the relationship between a set of predictors and an outcome. 机器学习包括 定型推理 步骤。Machine learning consists of training and inference steps. 您可以使用现有数据集来 训练 模型,然后使用该模型来预测新数据 (推理) 的结果。You train a model using an existing dataset, and then use that model to predict the outcomes (inference) of new data.

Run

与训练机器学习模型相关的参数、指标和标记的集合。A collection of parameters, metrics, and tags related to training a machine learning model.

试验Experiment

组织的主要单位和运行的访问控制;所有 MLflow 运行都属于试验。The primary unit of organization and access control for runs; all MLflow runs belong to an experiment. 通过试验,可以可视化、搜索和比较运行,还可以下载运行项目或元数据以在其他工具中进行分析。An experiment lets you visualize, search, and compare runs, as well as download run artifacts or metadata for analysis in other tools.

身份验证和授权Authentication and authorization

本部分介绍在管理 Azure Databricks 用户及其对 Azure Databricks 资产的访问权限时需要了解的概念。This section describes concepts that you need to know when you manage Azure Databricks users and their access to Azure Databricks assets.

用户User

有权访问系统的唯一人员。A unique individual who has access to the system.

Group

用户集合。A collection of users.

(ACL) 的访问控制列表 Access control list (ACL)

附加到工作区、群集、作业、表或实验的权限的列表。A list of permissions attached to the Workspace, cluster, job, table, or experiment. ACL 指定为哪些用户或系统进程授予了对对象的访问权限,以及对资产允许哪些操作。An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. 典型 ACL 中的每个条目指定一个主题和一个操作。Each entry in a typical ACL specifies a subject and an operation.