将 Azure Databricks 工作区连接到本地网络Connect your Azure Databricks Workspace to your on-premises network

此概要指南介绍如何建立从 Azure Databricks 工作区到本地网络的连接。This is a high-level guide on how to establish connectivity from your Azure Databricks workspace to your on-premises network. 它基于下图中显示的中心辐射型拓扑,其中流量通过传输虚拟网络 (VNet) 路由到本地网络。It is based on the hub-and-spoke topology shown in the following diagram, where traffic is routed via a transit virtual network (VNet) to the on-premises network.

此过程要求将 Azure Databricks 工作区部署在你自己的虚拟网络中(也称为 VNet 注入)。This process requires your Azure Databricks workspace to be deployed in your own virtual network (also known as VNet injection).

虚拟网络部署Virtual network deployment

备注

请随时联系你的 Microsoft 和 Databricks 帐户团队,讨论本文中所述的配置过程。Don’t hesitate to reach out to your Microsoft and Databricks account teams to discuss the configuration process described in this article.

先决条件Prerequisites

Azure Databricks 工作区必须部署在你自己的虚拟网络中。Your Azure Databricks workspace must be deployed in your own virtual network.

步骤 1:使用 Azure 虚拟网络网关设置传输虚拟网络Step 1: Set up a transit virtual network with Azure Virtual Network Gateway

本地连接需要传输 VNet 中的虚拟网络网关(ExpressRoute 或 VPN)。On-premises connectivity requires a Virtual Network Gateway (ExpressRoute or VPN) in a transit VNet. 如果已存在,请跳到步骤 2。Skip to step 2 if one already exists.

备注

若要获得帮助,请联系你的 Microsoft 帐户团队。For assistance, contact your Microsoft account team.

步骤 2:将 Azure Databricks 虚拟网络和传输虚拟网络对等互连Step 2: Peer the Azure Databricks virtual network with the transit virtual network

如果 Azure Databricks 工作区与虚拟网络网关不在同一 VNet 中,请按照对等互连虚拟网络中的说明将 Azure Databricks VNet 对等互连到传输 VNet,选择以下选项:If your Azure Databricks workspace is not in the same VNet as the Virtual Network Gateway, follow the instructions in Peer virtual networks to peer the Azure Databricks VNet to the transit VNet, selecting the following options:

  • 在 Azure Databricks VNet 端使用远程网关。Use Remote Gateways on the Azure Databricks VNet side.
  • 在传输 VNet 端允许网关传输。Allow Gateway Transit on the Transit VNet side.

可以在创建对等互连中了解有关这些选项的详细信息。You can learn more about these options in Create a peering.

备注

如果与 Azure Databricks 的本地网络连接不能与上述设置一起使用,你还可以选择对等互连两侧的“允许转发的流量”选项来解决此问题。If your on-premises network connection to Azure Databricks does not work with the above settings, you can also select the Allow Forwarded Traffic option on both sides of the peering to resolve the issue.

有关为虚拟网络对等互连配置 VPN 网关传输的信息,请参阅为虚拟网络对等互连配置 VPN 网关传输For more information about configuring VPN gateway transit for virtual network peering, see Configure VPN gateway transit for virtual network peering.

步骤 3:创建用户定义的路由,并将其与 Azure Databricks 虚拟网络子网关联 Step 3: Create user-defined routes and associate them with your Azure Databricks virtual network subnets

一旦 Azure Databricks VNet 与传输 VNet 对等互连(使用虚拟网络网关),Azure 将通过传输 VNet 自动配置所有路由。Once the Azure Databricks VNet is peered with the transit VNet (using the virtual network gateway), Azure auto-configures all routes via the transit VNet. 这可能会开始中断 Azure Databricks 工作区中的群集设置,因为可能缺少从群集节点到 Azure Databricks 控制平面的正确配置的返回路由。This could start breaking cluster setup within the Azure Databricks workspace, because a properly configured return route from cluster nodes to the Azure Databricks control plane could be missing. 因此,必须创建用户定义的路由(也称为 UDR 或自定义路由)。You must therefore create user-defined routes (also known as UDR or custom routes).

  1. 按照创建路由表中的说明创建路由表。Create a route table using the instructions in Create a route table.

    创建路由表时,请启用 BGP 路由传播。When you create the route table, enable BGP route propagation.

    备注

    如果在测试期间本地网络连接设置失败,你可能需要禁用 BGP 路由传播选项。If your on-premises network connection setup fails during testing, you might need to disable the BGP route propagation option. 禁用仅作为最后的手段。Disable as a last resort only.

  2. 使用自定义路由中的说明为以下服务添加用户定义的路由。Add user-defined routes for the following services, using the instructions in Custom routes.

    备注

    使用控制平面 NAT IP 还是 SCC 中继 IP 取决于是否为工作区启用了 (SCC) 的安全群集连接 Whether you use the Control plane NAT IP or the SCC relay IP depends on whether secure cluster connectivity (SCC) is enabled for the workspace.

    Source 地址前缀Address prefix 下一跃点类型Next hop type
    默认Default 控制平面 NAT IPControl plane NAT IP
    (仅当禁用了 SCC 时需要)(Only if SCC is disabled)
    InternetInternet
    默认Default SCC 中继 IPSCC relay IP
    (仅当启用了 SCC 时需要)(Only if SCC is enabled)
    InternetInternet
    默认Default Webapp IPWebapp IP InternetInternet
    默认Default 扩展基础结构 IPExtended infrastructure IP InternetInternet
    默认Default 元存储 IPMetastore IP InternetInternet
    默认Default 项目 Blob 存储 IPArtifact Blob storage IP InternetInternet
    默认Default 日志 Blob 存储 IPLog Blob storage IP InternetInternet
    默认Default DBFS 根 Blob 存储 IPDBFS root Blob storage IP InternetInternet

    若要获取这些服务中的每个服务的 IP 地址,请按照用户定义的 Azure Databricks 路由设置中的说明进行操作。To get the IP addresses for each of these services, follow the instructions in User-defined route settings for Azure Databricks.

  3. 使用将路由表关联到子网中的说明,将路由表与 Azure Databricks VNet 公共和专用子网关联。Associate the route table with your Azure Databricks VNet public and private subnets, using the instructions in Associate a route table to a subnet.

    一旦自定义路由表与 Azure Databricks VNet 子网关联,就没有必要编辑网络安全组中的出站安全规则。Once the custom route table has been associated with your Azure Databricks VNet subnets, it is unnecessary to edit the outbound security rules in the network security group. 可以选择更改出站规则,使“Internet / Any”更具针对性,但这样做没有实际意义,因为路由将控制实际流出量。You could choose to change the outbound rule for “Internet / Any” to be more specific, but there is no real reason to, because the routes will control the actual egress.

备注

如果基于 IP 的路由在测试期间失败,则可以为 Microsoft.Storage 创建服务终结点,使所有 Blob 存储流量通过 Azure 主干。If the IP-based route fails during testing, you can create a service endpoint for Microsoft.Storage, such that all Blob storage traffic goes through the Azure backbone. 也可以采用此方法,而不是为 Blob 存储创建用户定义的路由。You can also take this approach instead of creating user-defined routes for Blob storage.

备注

如果要从 Azure Databricks(如 CosmosDB、Azure Synapse Analytics 等)访问其他 PaaS Azure 数据服务,还必须将用户定义的路由添加到这些服务的路由表。If you want to access other PaaS Azure data services from Azure Databricks, like CosmosDB, Azure Synapse Analytics and others, you must add user-defined routes to the route table for those services as well. 使用 nslookup 或等效的命令将每个终结点解析到其 IP 地址。Resolve each endpoint to its IP address using nslookup or an equivalent command.

步骤 4:验证设置 Step 4: Validate the setup

要验证设置:To validate the setup:

  1. 在 Azure Databricks 工作区中创建群集。Create a cluster in your Azure Databricks workspace.

    如果此操作失败,请再次执行步骤 1-3 中的说明,尝试注释中提到的备用配置。If this fails, go through the instructions in Steps 1-3 again, trying the alternate configurations mentioned in the notes.

    如果仍然无法创建群集,请检查路由表是否具有所有所需的用户定义的路由。If you still cannot create a cluster, check to see if the route table has all of the required user-defined routes. 如果对 Blob 存储使用服务终结点而不是用户定义的路由,也请检查该配置。If you used service endpoints rather than user-defined routes for Blob storage, check that configuration as well.

    如果失败,请联系你的 Microsoft 和 Databricks 帐户团队寻求帮助。If that fails, reach out to your Microsoft and Databricks account teams for assistance.

  2. 创建群集后,尝试使用 %sh 从笔记本进行简单 ping 来连接到本地 VM。Once the cluster is created, try connecting to an on-premises VM by doing a simple ping from a notebook using %sh.

在进行故障排除时,以下指南也很有帮助:The following guidance can also be helpful when troubleshooting:

选项:使用虚拟设备或防火墙路由 Azure Databricks 流量 Option: Route Azure Databricks traffic using a virtual appliance or firewall

你可能还需要使用防火墙或 DLP 设备(如 Azure 防火墙、Palo Alto、Barracuda 等)筛选或审核 Azure Databricks 群集节点的所有传出流量。You might also want to filter or vet all outgoing traffic from Azure Databricks cluster nodes using a firewall or DLP appliance (such as Azure Firewall, Palo Alto, Barracuda, and so forth). 这可能需要执行以下操作:This may be required to:

  • 满足要求“检查”并允许或拒绝配置的所有传出流量的企业安全策略。Satisfy enterprise security policies that mandate all outgoing traffic to be “inspected” and allowed or denied as configured.
  • 获取所有 Azure Databricks 群集的单个 NAT (如公共 IP)或 CIDR,它们可在任何数据源的允许列表中进行配置。Get a single NAT-like public IP or CIDR for all Azure Databricks clusters, which could be configured in an allow list for any data source.

若要设置此方式的筛选,请按照步骤 1-4 以及一些附加步骤中的说明操作。To set up filtering this way, follow the instructions in Steps 1-4, with some additional steps. 以下内容仅供参考;详细信息可能因防火墙设备而异:What follows is just for reference; details may vary by firewall appliance:

  1. 按照创建 NVA 中的说明在传输 VNet 中设置虚拟设备或防火墙。Set up a virtual appliance or firewall within the transit VNet, using the instructions in Create an NVA.

    或者,可以在 Azure Databricks VNet 中的安全或 DMZ 子网中创建防火墙,该子网与现有专用子网和公共子网分开。Alternatively, you can create the firewall in a secure or DMZ subnet within the Azure Databricks VNet, which is separate from existing private and public subnets. 但是,如果需要为多个工作区设置防火墙,则建议将传输 VNet 解决方案作为中心辐射型拓扑。The transit VNet solution is recommended as your hub-spoke topology, however, if you require a firewall setup for multiple workspaces.

  2. 自定义路由表中创建到 0.0.0.0/0 的附加路由,并有类型为“虚拟设备”的下一跃点和适当的下一跃点地址。Create an additional route in the custom route table to 0.0.0.0/0 with Next hop type as “Virtual Appliance” and appropriate Next hop address.

    在步骤 3 中配置的路由应保留,但如果需要通过防火墙路由 Blob 存储流量,则可以删除 Blob 存储的路由(或服务终结点)。The routes configured in Step 3 should remain, although routes (or service endpoints) for Blob storage can be removed if all Blob storage traffic needs to be routed via the firewall.

    如果使用安全或 DMZ 子网方法,则可能需要创建一个附加路由表以仅与该子网关联。If you use the secure or DMZ subnet approach, you may need to create an additional route table to be associated with that subnet only. 该路由表应具有到 0.0.0.0 的路由,且下一跃点类型为“Internet”或“虚拟网络网关”,具体取决于流量是直接发往公共网络还是通过本地网络。That route table should have a route to 0.0.0.0 with Next hop type as “Internet” or “Virtual network gateway”, depending on whether traffic is destined for a public network directly or via an on-premises network.

  3. 在防火墙设备中配置“允许”和“拒绝”规则。Configure allow and deny rules in the firewall appliance.

    如果已删除 Blob 存储的路由,则应将这些路由添加到防火墙的允许列表中。If you have removed the routes for Blob storage, those routes should be added to the allow list in the firewall.

    你可能还需要将一些公共存储库 (例如对于 Ubuntu 等) 到允许列表,以确保正确创建群集。You might also need to add certain public repositories (such as for Ubuntu and so forth) to the allow list, to ensure that clusters are created correctly.

    有关允许列表的信息,请参阅 Azure Databricks 的用户定义路由设置For information about allow lists, see User-defined route settings for Azure Databricks.

选项:配置自定义 DNS Option: Configure custom DNS

如果要使用自己的 DNS 进行名称解析,可以使用部署在你自己的虚拟网络中的 Azure Databricks 工作区进行此操作。If you want to use your own DNS for name resolution, you can do that with Azure Databricks workspaces deployed in your own virtual network. 有关如何为 Azure 虚拟网络配置自定义 DNS 的详细信息,请参阅以下 Microsoft 文档:See the following Microsoft documents for more information about how to configure custom DNS for an Azure virtual network: