什么是 SQL Server 大数据群集SQL Server Big Data ClustersWhat are SQL Server 大数据群集SQL Server Big Data Clusters?

适用于:Applies to: 是SQL Server 2019 (15.x)SQL Server 2019 (15.x)yesSQL Server 2019 (15.x)SQL Server 2019 (15.x)适用于:Applies to: 是SQL Server 2019 (15.x)SQL Server 2019 (15.x)yesSQL Server 2019 (15.x)SQL Server 2019 (15.x)

SQL Server 2019 (15.x)SQL Server 2019 (15.x) 开始,借助 SQL Server 大数据群集SQL Server Big Data Clusters 可部署在 Kubernetes 上运行的 SQL Server、Spark 和 HDFS 容器的可缩放群集。Starting with SQL Server 2019 (15.x)SQL Server 2019 (15.x), SQL Server 大数据群集SQL Server Big Data Clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. 这些组件并行运行以确保可读取、写入和处理 Transact-SQL 或 Spark 中的大数据,这样你就可以借助大量大数据轻松合并并分析高价值关系数据。These components are running side by side to enable you to read, write, and process big data from Transact-SQL or Spark, allowing you to easily combine and analyze your high-value relational data with high-volume big data.

使用 SQL Server 大数据群集SQL Server Big Data Clusters 可以:Use SQL Server 大数据群集SQL Server Big Data Clusters to:

  • 部署 SQL Server、Spark 和在 Kubernetes 上运行的 HDFS 容器的可缩放群集。Deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes.
  • 在 Transact-SQL 或 Spark 中读取、写入和处理大数据。Read, write, and process big data from Transact-SQL or Spark.
  • 通过大容量大数据轻松合并和分析高价值关系数据。Easily combine and analyze high-value relational data with high-volume big data.
  • 查询外部数据源。Query external data sources.
  • 在由 SQL Server 管理的 HDFS 中存储大数据。Store big data in HDFS managed by SQL Server.
  • 通过群集查询多个外部数据源的数据。Query data from multiple external data sources through the cluster.
  • 将数据用于 AI、机器学习和其他分析任务。Use the data for AI, machine learning, and other analysis tasks.
  • 大数据群集Big Data Clusters部署和运行应用程序Deploy and run applications in 大数据群集Big Data Clusters.
  • 使用 PolyBase 虚拟化数据。Virtualize data with PolyBase. 使用外部表从外部 SQL Server、Oracle、Teradata、MongoDB 和 ODBC 数据源查询数据。Query data from external SQL Server, Oracle, Teradata, MongoDB, and ODBC data sources with external tables.
  • 使用 Always On 可用性组技术为 SQL Server 主实例和所有数据库提供高可用性。Provide high availability for the SQL Server master instance and all databases by using Always On availability group technology.

有关最新版本的新功能和已知问题的详细信息,请参阅发行说明For more information about new features and known issues for latest release, see the release notes.

方案Scenarios

使用 SQL Server 大数据群集SQL Server Big Data Clusters 可灵活处理大数据。SQL Server 大数据群集SQL Server Big Data Clusters provide flexibility in how you interact with your big data. 可查询外部数据源,存储通过 SQL Server 管理的 HDFS 中的大数据,或通过群集查询来自多个外部数据源的数据。You can query external data sources, store big data in HDFS managed by SQL Server, or query data from multiple external data sources through the cluster. 然后,可以将数据用于 AI,机器学习和其他分析任务。You can then use the data for AI, machine learning, and other analysis tasks. 下列各部分提供了有关这些方案的详细信息。The following sections provide more information about these scenarios.

数据虚拟化Data virtualization

通过利用 PolyBaseSQL Server 大数据群集SQL Server Big Data Clusters 可以查询外部数据源,而无需移动或复制数据。By leveraging PolyBase, SQL Server 大数据群集SQL Server Big Data Clusters can query external data sources without moving or copying the data. SQL Server 2019 (15.x)SQL Server 2019 (15.x) 引入了数据源的新连接器。introduces new connectors to data sources.

数据虚拟化

Data LakeData lake

SQL Server 大数据群集包括可缩放的 HDFS 存储池 。A SQL Server big data cluster includes a scalable HDFS storage pool. 这可用于存储可能来自多个外部源的大数据。This can be used to store big data, potentially ingested from multiple external sources. 大数据存储在大数据群集中的 HDFS 中后,便可分析和查询数据并将其与关系数据相结合。Once the big data is stored in HDFS in the big data cluster, you can analyze and query the data and combine it with your relational data.

Data Lake

横向扩展数据市场Scale-out data mart

SQL Server 大数据群集SQL Server Big Data Clusters 提供了横向扩展计算和存储以提高分析任何数据的性能。provide scale-out compute and storage to improve the performance of analyzing any data. 来自各种源的数据可作为缓存跨数据池节点进行引入和分布以供进一步分析 。Data from a variety of sources can be ingested and distributed across data pool nodes as a cache for further analysis.

数据市场

集成的 AI 和机器学习Integrated AI and Machine Learning

SQL Server 大数据群集SQL Server Big Data Clusters 可对 HDFS 存储池和数据池中的数据启用 AI 和机器学习任务。enable AI and machine learning tasks on the data stored in HDFS storage pools and the data pools. 使用 R、Python、Scala 或者 Java,可在 SQL Server 中使用 Spark 以及内置的 AI 工具。You can use Spark as well as built-in AI tools in SQL Server, using R, Python, Scala, or Java.

AI 和 ML

管理和监视Management and Monitoring

通过命令行工具、API、门户和动态管理视图的组合提供管理和监视。Management and monitoring are provided through a combination of command line tools, APIs, portals, and dynamic management views.

可以使用 Azure Data Studio 在大数据群集上执行各种任务:You can use Azure Data Studio to perform a variety of tasks on the big data cluster:

  • 用于常见管理任务的内置片段。Built-in snippets for common management tasks.
  • 浏览 HDFS、上传文件、预览文件和创建目录的能力。Ability to browse HDFS, upload files, preview files, and create directories.
  • 创建、打开和运行与 Jupyter 兼容的笔记本的能力。Ability to create, open, and run Jupyter-compatible notebooks.
  • 用于简化外部数据源的创建的数据虚拟化向导(由数据虚拟化扩展启用 )。Data virtualization wizard to simplify the creation of external data sources (enabled by the Data Virtualization Extension ).

体系结构Architecture

SQL Server 大数据群集是由 Kubernetes 编排的 Linux 容器群集。A SQL Server big data cluster is a cluster of Linux containers orchestrated by Kubernetes.

Kubernetes 的概念Kubernetes concepts

Kubernetes 是一个开放源代码容器业务流程协调程序,可以根据需要缩放容器部署。Kubernetes is an open source container orchestrator, which can scale container deployments according to need. 下表定义了一些重要的 Kubernetes 术语:The following table defines some important Kubernetes terminology:

术语Term 描述Description
ClusterCluster Kubernetes 群集是一组称为节点的计算机。A Kubernetes cluster is a set of machines, known as nodes. 一个节点控制群集并被指定为主节点;其余节点是工作器节点。One node controls the cluster and is designated the master node; the remaining nodes are worker nodes. Kubernetes 主节点负责在工作器节点之间分配工作,并负责监视群集的运行状况。The Kubernetes master is responsible for distributing work between the workers, and for monitoring the health of the cluster.
NodeNode 节点运行容器化应用程序。A node runs containerized applications. 它可以是物理计算机或虚拟机。It can be either a physical machine or a virtual machine. Kubernetes 群集可以混合包含物理计算机节点和虚拟机节点。A Kubernetes cluster can contain a mixture of physical machine and virtual machine nodes.
PodPod Pod 是 Kubernetes 的原子部署单元。A pod is the atomic deployment unit of Kubernetes. Pod 是运行应用程序所需的一个或多个容器和相关资源的逻辑组。A pod is a logical group of one or more containers-and associated resources-needed to run an application. 一个 Pod 只能在一个节点上运行;一个节点可以运行一个或多个 Pod。Each pod runs on a node; a node can run one or more pods. Kubernetes 主节点自动将 Pod 分配给群集中的其余节点。The Kubernetes master automatically assigns pods to nodes in the cluster.
 

SQL Server 大数据群集SQL Server Big Data Clusters 中,Kubernetes 负责 SQL Server 大数据群集SQL Server Big Data Clusters 的状态;Kubernetes 生成和配置群集节点、将 Pod 分配给节点,并监视群集的运行状况。In SQL Server 大数据群集SQL Server Big Data Clusters, Kubernetes is responsible for the state of the SQL Server 大数据群集SQL Server Big Data Clusters; Kubernetes builds and configures the cluster nodes, assigns pods to nodes, and monitors the health of the cluster.

大数据群集体系结构Big data clusters architecture

下图显示了 SQL Server 大数据群集的组件:The following diagram shows the components of a SQL Server big data cluster:

体系结构概述

控制器Controller

控制器为群集提供管理和安全性。The controller provides management and security for the cluster. 它包含控制服务、配置存储和其他群集级服务,例如 Kibana、Grafana 和 Elastic Search。It contains the control service, the configuration store, and other cluster-level services such as Kibana, Grafana, and Elastic Search.

计算池Compute pool

计算池为群集提供计算资源。The compute pool provides computational resources to the cluster. 它包含在 Linux 上的 SQL Server Pod 上运行的节点。It contains nodes running SQL Server on Linux pods. 计算池中的 Pod 分为用于特定处理任务的 SQL Compute实例。The pods in the compute pool are divided into SQL Compute instances for specific processing tasks.

数据池Data pool

数据池用于数据暂留和缓存。The data pool is used for data persistence and caching. 数据池由一个或多个运行 Linux 上的 SQL Server 的 Pod 组成。The data pool consists of one or more pods running SQL Server on Linux. 它用于从 SQL 查询或 Spark 作业中提取数据。It is used to ingest data from SQL queries or Spark jobs. SQL Server 大数据群集数据市场持久保留在数据池中。SQL Server big data cluster data marts are persisted in the data pool.

存储池Storage pool

存储池由 Linux 上的 SQL Server、Spark 和 HDFS 组成的存储池 Pod 组成。The storage pool consists of storage pool pods comprised of SQL Server on Linux, Spark, and HDFS. SQL Server 大数据群集中的所有存储节点都是 HDFS 群集的成员。All the storage nodes in a SQL Server big data cluster are members of an HDFS cluster.

提示

如需深入了解大数据群集体系结构和安装,请参阅研讨会:Microsoft SQL Server 大数据群集SQL Server Big Data Clusters 体系结构For an in-depth look into big data cluster architecture and installation, see Workshop: Microsoft SQL Server 大数据群集SQL Server Big Data Clusters Architecture.

后续步骤Next steps

有关部署 SQL Server 大数据群集SQL Server Big Data Clusters 的详细信息,请参阅 SQL Server 大数据群集SQL Server Big Data Clusters 入门For more information about deploying SQL Server 大数据群集SQL Server Big Data Clusters, see Get started with SQL Server 大数据群集SQL Server Big Data Clusters.