您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

什么是 Azure Databricks?What is Azure Databricks?

Azure Databricks 是基于 Apache Spark 的分析平台,已针对 Microsoft Azure 云服务平台进行优化。Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. 我们与 Apache Spark 的创建者一起设计了 Databricks,并将其与 Azure 集成以提供一键式安装程序、简化的工作流程以及交互式工作区,从而使数据科学家、数据工程师和业务分析员之间可以进行合作。Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

什么是 Azure Databricks?What is Azure Databricks?

Azure Databricks 是基于Apache Spark 的快速、简单、协作型分析服务。Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service. 使用大数据管道时,原始或结构化的数据将通过 Azure 数据工厂以批的形式引入 Azure,或者通过 Kafka、事件中心或 IoT 中心进行准实时的流式传输。For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Kafka, Event Hub, or IoT Hub. 此数据将驻留在 Data Lake(长久存储)、Azure Blob 存储或 Azure Data Lake Storage 中。This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. 在运行分析工作流的过程中,可以使用 Azure Databricks 从 Azure Blob 存储Azure Data Lake StorageAzure Cosmos DBAzure SQL 数据仓库等多个数据源读取数据,并使用 Spark 将数据转化为前所未有的见解。As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark.

Databricks 管道

基于 Apache Spark 的分析平台Apache Spark-based analytics platform

Azure Databricks 包括配套的开源 Apache Spark 群集技术和功能。Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities. Azure Databricks 中的 Spark 包括以下组件:Spark in Azure Databricks includes the following components:

Azure Databricks 中的 Apache SparkApache Spark in Azure Databricks

  • Spark SQL 和数据帧:Spark SQL 是用于处理结构化数据的 Spark 模块。Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data. 数据帧是已组织成命名列的分布式数据集合。A DataFrame is a distributed collection of data organized into named columns. 它在概念上相当于关系型数据库中的表,或 R/Python 中的数据帧。It is conceptually equivalent to a table in a relational database or a data frame in R/Python.

  • 流式处理:针对分析与交互式应用程序的实时数据处理和分析。Streaming: Real-time data processing and analysis for analytical and interactive applications. 与 HDFS、Flume 和 Kafka 集成。Integrates with HDFS, Flume, and Kafka.

  • MLib:由常见学习算法和实用工具(包括分类、回归、群集、协作筛选、维数约简以及底层优化基元)组成的机器学习库。MLib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

  • GraphX:针对从认知分析到数据浏览的广泛范围显示图形和执行图形计算。GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.

  • Spark Core API:包含对 R、SQL、Python、Scala 和 Java 的支持。Spark Core API: Includes support for R, SQL, Python, Scala, and Java.

Azure Databricks 中的 Apache SparkApache Spark in Azure Databricks

Azure Databricks 构建在 Spark 功能的基础之上,提供一个无管理云平台,其中包括:Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes:

  • 完全托管的 Spark 群集Fully managed Spark clusters
  • 用于浏览和可视化数据的交互工作区An interactive workspace for exploration and visualization
  • 为基于 Spark 的偏好应用程序提供动力的平台A platform for powering your favorite Spark-based applications

完全在云中托管的 Apache Spark 群集Fully managed Apache Spark clusters in the cloud

Azure Databricks 在云中拥有安全可靠的生产环境,由 Spark 专家进行管理和提供支持。Azure Databricks has a secure and reliable production environment in the cloud, managed and supported by Spark experts. 可以:You can:

  • 在几秒钟内创建群集。Create clusters in seconds.
  • 动态自动扩展和缩减群集(包括无服务器群集)并在团队中共享群集。Dynamically autoscale clusters up and down, including serverless clusters, and share them across teams.
  • 通过 REST API 以编程方式使用群集。Use clusters programmatically by using the REST APIs.
  • 使用基于 Spark 的安全数据集成功能,在无需集中化的情况下统一数据。Use secure data integration capabilities built on top of Spark that enable you to unify your data without centralization.
  • 即时访问每个版本中的最新 Apache Spark 功能。Get instant access to the latest Apache Spark features with each release.

Databricks 运行时Databricks Runtime

Databricks 运行时构建在 Apache Spark 的基础之上,原生针对 Azure 云构建。The Databricks Runtime is built on top of Apache Spark and is natively built for the Azure cloud.

与“无服务器”选项一样,Azure Databricks 完全消除了设置和配置数据基础结构所存在的基础结构复杂性以及所需的专业知识。 With the Serverless option, Azure Databricks completely abstracts out the infrastructure complexity and the need for specialized expertise to set up and configure your data infrastructure. “无服务器”选项可帮助数据科学家以团队形式快速迭代。The Serverless option helps data scientists iterate quickly as a team.

对于关注生产作业性能的数据工程师而言,Azure Databricks 通过 I/O 层和处理层 (Databricks I/O) 的各种优化提供一个更快速、更高效的 Spark 引擎。For data engineers, who care about the performance of production jobs, Azure Databricks provides a Spark engine that is faster and performant through various optimizations at the I/O layer and processing layer (Databricks I/O).

实现协作的工作区Workspace for collaboration

通过协作和集成式环境,Azure Databricks 简化了在 Spark 中浏览数据、制作原型和运行数据驱动的应用程序的过程。Through a collaborative and integrated environment, Azure Databricks streamlines the process of exploring data, prototyping, and running data-driven applications in Spark.

  • 通过简单的数据浏览确定如何使用数据。Determine how to use data with easy data exploration.
  • 在以 R、Python、Scala 或 SQL 编写的笔记本中记录进度。Document your progress in notebooks in R, Python, Scala, or SQL.
  • 单击几下鼠标将数据可视化,可使用熟悉的工具,例如 Matplotlib、ggplot 或 d3。Visualize data in a few clicks, and use familiar tools like Matplotlib, ggplot, or d3.
  • 使用交互式仪表板创建动态报告。Use interactive dashboards to create dynamic reports.
  • 在使用 Spark 的同时与数据交互。Use Spark and interact with the data simultaneously.

企业安全性Enterprise security

Azure Databricks 提供企业级的 Azure 安全性,包括 Azure Active Directory 集成、基于角色的控制,以及可保护数据和业务的 SLA。Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.

  • 与 Azure Active Directory 集成后,可以使用 Azure Databricks 运行基于 Azure 的完整解决方案。Integration with Azure Active Directory enables you to run complete Azure-based solutions using Azure Databricks.
  • 使用 Azure Databricks 基于角色的访问可以精细地向用户授予对笔记本、群集、作业和数据的权限。Azure Databricks roles-based access enables fine-grained user permissions for notebooks, clusters, jobs, and data.
  • 企业级 SLA。Enterprise-grade SLAs.

与 Azure 服务集成Integration with Azure services

Azure Databricks 与以下 Azure 数据库和存储深度集成:SQL 数据仓库、Cosmos DB、Data Lake Store 和 Blob 存储。Azure Databricks integrates deeply with Azure databases and stores: SQL Data Warehouse, Cosmos DB, Data Lake Store, and Blob Storage.

与 Power BI 集成Integration with Power BI

通过与 Power BI 的多样化集成,可在 Azure Databricks 中快速轻松地发现和共享有影响力的见解。Through rich integration with Power BI, Azure Databricks allows you to discover and share your impactful insights quickly and easily. 还可以通过 JDBC/ODBC 群集终结点使用其他 BI 工具,例如 Tableau 软件。You can use other BI tools as well, such as Tableau Software via JDBC/ODBC cluster endpoints.

后续步骤Next steps