什么是 Azure Databricks 工作区?What is Azure Databricks Workspace?

Azure Databricks 工作区是一个基于 Apache Spark 的分析平台。Azure Databricks Workspace is an analytics platform based on Apache Spark. Azure Databricks 工作区与 Azure 集成,以提供一键式安装程序、简化的工作流程以及交互式工作区,从而使数据工程师、数据科学家和机器学习工程师之间可以进行协作。Azure Databricks Workspace is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers.

什么是 Azure Databricks?What is Azure Databricks?

使用大数据管道时,原始或结构化的数据将通过 Azure 数据工厂以批的形式引入 Azure,或者通过 Apache Kafka、事件中心或 IoT 中心进行准实时的流式传输。For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. 此数据将驻留在 Data Lake(长久存储)、Azure Blob 存储或 Azure Data Lake Storage 中。This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. 在运行分析工作流的过程中,可以使用 Azure Databricks 从 Azure Blob 存储Azure Data Lake StorageAzure Cosmos DBAzure SQL 数据仓库等多个数据源读取数据,并使用 Spark 将数据转化为前所未有的见解。As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark.

Databricks 管道

Apache Spark 分析平台Apache Spark analytics platform

Azure Databricks 工作区包含完整的开源 Apache Spark 群集技术和功能。Azure Databricks Workspace comprises the complete open-source Apache Spark cluster technologies and capabilities. Azure Databricks 工作区中的 Spark 包括以下组件:Spark in Azure Databricks Workspace includes the following components:

Azure Databricks 中的 Apache SparkApache Spark in Azure Databricks

  • Spark SQL 和数据帧:Spark SQL 是用于处理结构化数据的 Spark 模块。Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data. 数据帧是已组织成命名列的分布式数据集合。A DataFrame is a distributed collection of data organized into named columns. 它在概念上相当于关系型数据库中的表,或 R/Python 中的数据帧。It is conceptually equivalent to a table in a relational database or a data frame in R/Python.

  • 流式处理:实时数据处理和分析,适用于分析与交互式应用程序。Streaming: Real-time data processing and analysis for analytical and interactive applications. 与 HDFS、Flume 和 Kafka 集成。Integrates with HDFS, Flume, and Kafka.

  • MLlib:由常见学习算法和实用工具(包括分类、回归、群集、协作筛选、维数约简以及底层优化基元)组成的机器学习库。MLlib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

  • GraphX:图形和图形计算,适用于从认知分析到数据探索的广泛用例。GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.

  • Spark Core API:包含对 R、SQL、Python、Scala 和 Java 的支持。Spark Core API: Includes support for R, SQL, Python, Scala, and Java.

Azure Databricks 工作区中的 Apache SparkApache Spark in Azure Databricks Workspace

Azure Databricks 工作区构建在 Spark 功能的基础之上,提供一个无管理云平台,其中包括:Azure Databricks Workspace builds on the capabilities of Spark by providing a zero-management cloud platform that includes:

  • 完全托管的 Spark 群集Fully managed Spark clusters
  • 可浏览和可视化数据的交互式工作区An interactive workspace for exploration and visualization
  • 一个为你喜爱的 Spark 应用程序提供支持的平台A platform for powering your favorite Spark applications

在云中完全托管的 Apache Spark 群集Fully managed Apache Spark clusters in the cloud

Azure Databricks 在云中拥有安全可靠的生产环境,由 Spark 专家进行管理和提供支持。Azure Databricks has a secure and reliable production environment in the cloud, managed and supported by Spark experts. 可以:You can:

  • 在几秒钟内创建群集。Create clusters in seconds.
  • 动态自动扩展和缩减群集并在团队中共享群集。Dynamically autoscale clusters up and down and share them across teams.
  • 通过调用 REST API 以编程方式使用群集。Use clusters programmatically by invoking REST APIs.
  • 使用基于 Spark 的安全数据集成功能,在无需集中化的情况下统一数据。Use secure data integration capabilities built on top of Spark that enable you to unify your data without centralization.
  • 即时获得每个版本中的最新 Apache Spark 功能。Get instant access to the latest Apache Spark features with each release.

Databricks RuntimeDatabricks Runtime

Databricks 运行时构建在 Apache Spark 的基础之上,是针对 Azure 云以原生方式构建的。Databricks Runtime is built on top of Apache Spark and is natively built for the Azure cloud.

Azure Databricks 通过高度抽象化彻底消除了基础结构复杂性,无需专业知识就能设置和配置数据基础结构。Azure Databricks completely abstracts out the infrastructure complexity and the need for specialized expertise to set up and configure your data infrastructure.

对于关注生产作业性能的数据工程师而言,Azure Databricks 通过 I/O 层和处理层 (Databricks I/O) 的各种优化提供了一个更快速、更高效的 Spark 引擎。For data engineers, who care about the performance of production jobs, Azure Databricks provides a Spark engine that is faster and performant through various optimizations at the I/O layer and processing layer (Databricks I/O).

实现协作的工作区Workspace for collaboration

通过协作和集成式环境,Azure Databricks 简化了在 Spark 中浏览数据、制作原型和运行数据驱动型应用程序的过程。Through a collaborative and integrated environment, Azure Databricks streamlines the process of exploring data, prototyping, and running data-driven applications in Spark.

  • 通过简单的数据浏览确定如何使用数据。Determine how to use data with easy data exploration.
  • 在以 R、Python、Scala 或 SQL 编写的笔记本中记录进度。Document your progress in notebooks in R, Python, Scala, or SQL.
  • 几步内即可实现数据可视化,可使用熟悉的工具,例如 Matplotlib、ggplot 或 d3。Visualize data in a few clicks, and use familiar tools like Matplotlib, ggplot, or d3.
  • 使用交互式仪表板创建动态报告。Use interactive dashboards to create dynamic reports.
  • 在使用 Spark 的同时与数据交互。Use Spark and interact with the data simultaneously.

企业安全性Enterprise security

Azure Databricks 工作区提供企业级的 Azure 安全性,包括 Azure Active Directory 集成、基于角色的控制,以及可保护数据和业务的 SLA。Azure Databricks Workspace provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.

  • 与 Azure Active Directory 集成后,可以使用 Azure Databricks 运行基于 Azure 的完整解决方案。Integration with Azure Active Directory enables you to run complete Azure-based solutions using Azure Databricks.
  • Azure Databricks 基于角色的访问可以细化用户对笔记本、群集、作业和数据的权限。Azure Databricks roles-based access enables fine-grained user permissions for notebooks, clusters, jobs, and data.
  • 企业级 SLA。Enterprise-grade SLAs.


Azure Databricks 工作区是部署在全局 Azure 公有云基础结构上的 Microsoft Azure 第一方服务。Azure Databricks Workspace is a Microsoft Azure first-party service that is deployed on the Global Azure Public Cloud infrastructure. 服务组件之间的所有通信(包括控制平面和客户数据平面中的公共 IP 之间的通信)都留在 Microsoft Azure 网络主干内进行。All communications between components of the service, including between the public IPs in the control plane and the customer data plane, remain within the Microsoft Azure network backbone. 另请参阅 Microsoft 全球网络See also Microsoft global network.

与 Azure 服务集成Integration with Azure services

Azure Databricks 工作区与以下 Azure 数据库和存储深度集成:Synapse Analytics、Cosmos DB、Data Lake Store 和 Blob 存储。Azure Databricks Workspace integrates deeply with Azure databases and stores: Synapse Analytics, Cosmos DB, Data Lake Store, and Blob storage.

与 Power BI 集成Integration with Power BI

通过与 Power BI 的多样化集成,可在 Azure Databricks 工作区中快速轻松地发现和共享有影响力的见解。Through rich integration with Power BI, Azure Databricks Workspace allows you to discover and share your impactful insights quickly and easily. 还可以使用其他 BI 工具,例如 Tableau 软件。You can use other BI tools as well, such as Tableau Software.

后续步骤Next steps