您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure 中的 R 开发指南R developer's guide to Azure

R logo

处理日益增长的数据量的许多数据科学家正在寻求利用云计算的强大能力进行分析。Many data scientists dealing with ever-increasing volumes of data are looking for ways to harness the power of cloud computing for their analyses. 本文概述数据科学家在 Azure 中利用现有 R 编程语言技能的各种方式。This article provides an overview of the various ways that data scientists can leverage their existing skills with the R programming language in Azure.

Microsoft 完全接受将 R 编程语言作为数据科学家的第一类工具。Microsoft has fully embraced the R programming language as a first-class tool for data scientists. 本公司为 R 开发人员提供许多不同的选项让他们在 Azure 中运行其代码,并使数据科学家在处理大型项目时能够将其数据科学工作负荷扩展到云中。By providing many different options for R developers to run their code in Azure, the company is enabling data scientists to extend their data science workloads into the cloud when tackling large-scale projects.

让我们了解各个选项,以及每个选项的最具吸引力的方案。Let's examine the various options and the most compelling scenarios for each one.

支持 R 语言的 Azure 服务Azure services with R language support

本文介绍支持 R 语言的以下 Azure 服务:This article covers the following Azure services that support the R language:

服务Service 描述Description
数据科学虚拟机Data Science Virtual Machine 用作数据科学工作站或自定义计算目标的自定义 VMa customized VM to use as a data science workstation or as a custom compute target
ML Services on HDInsightML Services on HDInsight 基于群集的系统,用于对跨多个节点的大型数据集运行 R 分析cluster-based system for running R analyses on large datasets across many nodes
Azure DatabricksAzure Databricks 支持 R 和其他语言的协作型 Spark 环境collaborative Spark environment that supports R and other languages
Azure 机器学习Azure Machine Learning 云服务,用于定型、部署、自动化和管理机器学习模型cloud service that you use to train, deploy, automate, and manage machine learning models
机器学习 Studio (经典)Machine Learning Studio (classic) 在 Azure 的机器学习试验中运行自定义 R 脚本run custom R scripts in Azure's machine learning experiments
Azure BatchAzure Batch 提供各种选项用于以经济节省的方式对群集中的多个节点运行 R 代码offers a variety options for economically running R code across many nodes in a cluster
Azure NotebookAzure Notebooks Jupyter Notebook 的基于云的免费版本a no-cost cloud-based version of Jupyter notebooks
Azure SQL 数据库Azure SQL Database 在 SQL Server 数据库引擎内部运行 R 脚本run R scripts inside of the SQL Server database engine

数据科学虚拟机Data Science Virtual Machine

Data Science Virtual Machine (DSVM) 是专为开展数据科学构建的 Microsoft Azure 云平台上的自定义 VM 映像。The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud platform built specifically for doing data science. 其中包含许多热门的数据科学工具,包括:It has many popular data science tools, including:

可以在 DSVM 上预配 Windows 或 Linux 操作系统。The DSVM can be provisioned with either Windows or Linux as the operating system. 可通过两种方式使用 DSVM:用作交互式工作站,或用作自定义群集的计算平台。You can use the DSVM in two different ways: as an interactive workstation or as a compute platform for a custom cluster.

用作工作站As a workstation

若要在云中快速轻松地开始使用 R,则这是最佳选项。If you want to get started with R in the cloud quickly and easily, this is your best bet. 在本地工作站上用过 R 的任何人都会熟悉该环境。The environment will be familiar to anyone who has worked with R on a local workstation. 但是,R 环境不是使用本地资源,而是在云中的 VM 上运行。However, instead of using local resources, the R environment runs on a VM in the cloud. 如果数据已存储在 Azure 中,则此环境还能带来另一种优势:R 脚本可以在“更靠近数据”的位置运行。If your data is already stored in Azure, this has the added benefit of allowing your R scripts to run "closer to the data." 无需通过 Internet 传输数据,可以通过 Azure 的内部网络访问数据,因此访问速度要快得多。Instead of transferring the data across the Internet, the data can be accessed over Azure's internal network, which provides much faster access times.

DSVM 可能对小型 R 开发人员团队特别有用。The DSVM can be particularly useful to small teams of R developers. 无需为每个开发人员投资购买高配的工作站并要求团队成员同步他们使用的各个软件包版本,每个开发人员可以根据需要运转 DSVM 的实例。Instead of investing in powerful workstations for each developer and requiring team members to synchronize on which versions of the various software packages they will use, each developer can spin up an instance of the DSVM whenever needed.

用作计算平台As a compute platform

除了用作工作站以外,DSVM 还可用作 R 项目的弹性可缩放计算平台。In addition to being used as a workstation, the DSVM is also used as an elastically scalable compute platform for R projects. 使用AzureDSVM R 包,可以通过编程方式控制 DSVM 实例的创建和删除。Using the AzureDSVM R package, you can programmatically control the creation and deletion of DSVM instances. 可将实例组建成群集,并部署要在云中执行的分布式分析。You can form the instances into a cluster and deploy a distributed analysis to be performed in the cloud. 可以通过本地工作站上运行的 R 代码控制整个过程。This entire process can be controlled by R code running on your local workstation.

若要了解有关 DSVM 的详细信息,请参阅适用于 Linux 和 Windows 的 Azure Data Science Virtual Machine 简介To learn more about the DSVM, see Introduction to Azure Data Science Virtual Machine for Linux and Windows.

ML Services on HDInsightML Services on HDInsight

Microsoft ML Services 可让数据科学家、统计师和 R 程序员按需访问 HDInsight 上可缩放的分布式分析方法。Microsoft ML Services provide data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. 此解决方案提供最新的功能,可针对载入 Azure Blob 或 Data Lake Storage 的几乎任何大小的数据集执行基于 R 的分析。This solution provides the latest capabilities for R-based analytics on datasets of virtually any size, loaded to either Azure Blob or Data Lake storage.

这是一个企业级的解决方案,允许在整个群集中缩放 R 代码。This is an enterprise-grade solution that allows you to scale your R code across a cluster. 利用 Microsoft 的RevoScaleR中的函数By leveraging functions in Microsoft's RevoScaleR 包,HDInsight 上的 R 脚本可以跨群集中的多个节点并行运行数据处理功能。package, your R scripts on HDInsight can run data processing functions in parallel across many nodes in a cluster. 这样,R 便可以使用工作站上运行的单线程 R,以远超寻常的规模处理数据。This allows R to crunch data on a much larger scale than is possible with single-threaded R running on a workstation.

这种大规模处理能力使得 ML Services on HDInsight 成了需要处理巨量数据集的 R 开发人员的极佳选项。This ability to scale makes ML Services on HDInsight a great option for R developers with massive data sets. 它提供一个灵活、可缩放的平台用于在云中运行 R 脚本。It provides a flexible and scalable platform for running your R scripts in the cloud.

有关创建 ML 服务群集的演练,请参阅Azure HDInsight 上的 ML 服务入门For a walk-through on creating an ML Services cluster, see Get started with ML Services on Azure HDInsight.

Azure DatabricksAzure Databricks

Azure Databricks 是基于 Apache Spark 的分析平台,已针对 Microsoft Azure 云服务平台进行优化。Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. 我们与 Apache Spark 的创建者一起设计了 Databricks,并将其与 Azure 集成以提供一键式安装程序、简化的工作流程以及交互式工作区,从而使数据科学家、数据工程师和业务分析员之间可以进行合作。Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Databricks 中的协作由平台的 Notebook 系统启用。The collaboration in Databricks is enabled by the platform's notebook system. 用户可以创建、编辑 Notebook 并与系统的其他用户共享。Users can create, share, and edit notebooks with other users of the systems. 用户可以使用这些 Notebook 编写针对 Databricks 环境中托管的 Spark 群集执行的代码。These notebooks allow users to write code that executes against Spark clusters managed in the Databricks environment. 这些 Notebook 完全支持 R,并可让用户通过 SparkRsparklyr 包访问 Spark。These notebooks fully support R and give users access to Spark through both the SparkR and sparklyr packages.

由于 Databricks 构建在 Spark 基础之上并且侧重于协作,该平台往往由共同解决大型数据集复杂分析的数据科学家团队使用。Since Databricks is built on Spark and has a strong focus on collaboration, the platform is often used by teams of data scientists that work together on complex analyses of large data sets. 由于 Databricks 中的 Notebook 除了支持 R 以外还支持其他语言,因此,它对于在主要工作中使用不同语言的分析师团队非常有用。Because the notebooks in Databricks support other languages in addition to R, it is especially useful for teams where analysts use different languages for their primary work.

这篇文章Azure Databricks?可以提供有关平台的详细信息,并帮助你入门。The article What is Azure Databricks? can provide more details about the platform and help you get started.

Azure 机器学习Azure Machine Learning

Azure 机器学习可用于任何类型的机器学习,从传统机器学习到深度学习、监督和无人监督学习。Azure Machine Learning can be used for any kind of machine learning, from classical machine learning to deep learning, supervised and unsupervised learning. 无论您是想编写 Python 还是 R 代码或零代码/低代码选项(如设计器),都可以在 Azure 机器学习工作区中构建、训练和跟踪非常准确的机器学习和深度学习模型。Whether you prefer to write Python or R code or zero-code/low-code options such as the designer, you can build, train and track highly accurate machine learning and deep-learning models in an Azure Machine Learning Workspace.

开始在本地计算机上训练,然后横向扩展到云。Start training on your local machine and then scale out to the cloud. 立即通过 Azure 机器学习训练 R 中的第一个模型Train your first model in R with Azure Machine Learning today.

Azure 机器学习 Studio (经典)Azure Machine Learning Studio (classic)

机器学习 Studio (经典)是一个协作式拖放式工具,可用于在云中构建、测试和部署预测分析解决方案。Machine Learning Studio (classic) is a collaborative, drag-and-drop tool you can use to build, test, and deploy predictive analytics solutions in the cloud. 越来越多的数据科学家正在使用它来创建和部署机器学习模型,而无需编写大量的代码。It enables emerging data scientists to create and deploy machine learning models without the need to write much code.

Azure 机器学习 Studio (经典)支持 R 和 Python。Azure Machine Learning Studio (classic) supports both R and Python.

鼓励当前正在使用或评估机器学习 Studio (经典)的客户尝试 Azure 机器学习设计器(预览版),该模块提供了拖-n drop ML 模块以及可伸缩性、版本控制和企业安全。Customers currently using or evaluating Machine Learning Studio (classic) are encouraged to try Azure Machine Learning designer (preview), which provides drag-n-drop ML modules plus scalability, version control, and enterprise security.

Azure 批处理Azure Batch

对于大规模 R 作业,可以使用 Azure BatchFor large-scale R jobs, you can use Azure Batch. 此服务提供云规模的作业计划和计算管理,可让你缩放跨数十、数百甚至数千个虚拟机的 R 工作负荷。This service provides cloud-scale job scheduling and compute management so you can scale your R workload across tens, hundreds, or thousands of virtual machines. 由于它是一个通用化的计算平台,在 Azure Batch 上运行 R 作业的选项有很多。Since it is a generalized computing platform, there a few options for running R jobs on Azure Batch.

一种选择是使用 Microsoft 的doAzureParallel包。One option is to use Microsoft's doAzureParallel package. 此 R 包是 foreach 包的并行后端。This R package is a parallel backend for the foreach package. 使用此包可在 Azure Batch 群集中的节点上并行运行 foreach 循环的每个迭代。It allows each iteration of the foreach loop to run in parallel on a node within the Azure Batch cluster. 有关包的简介,请参阅博客文章doAzureParallel:直接从 R 会话使用 Azure 的灵活计算For an introduction to the package, see the blog post doAzureParallel: Take advantage of Azure’s flexible compute directly from your R session.

在 Azure Batch 中运行 R 脚本的另一个选项是在 Azure 门户中使用“RScript.exe”将代码捆绑为 Batch 应用。Another option for running an R script in Azure Batch is to bundle your code with "RScript.exe" as a Batch App in the Azure portal. 有关详细的演练,请参阅Azure Batch 上的 R 工作负荷For a detailed walk-through, consult R Workloads on Azure Batch.

第三个选项是使用 Azure 分布式数据工程工具包 (AZTK)。该工具包可让你使用 Azure Batch 中的 Docker 容器预配按需 Spark 群集。A third option is to use the Azure Distributed Data Engineering Toolkit (AZTK), which allows you to provision on-demand Spark clusters using Docker containers in Azure Batch. 这样,便可以经济节省的方式在 Azure 中运行 Spark 作业。This provides an economical way to run Spark jobs in Azure. 使用 SparklyR 和 AZTK 可以在云中轻松、经济节省地扩展 R 脚本。By using SparklyR with AZTK, your R scripts can be scaled out in the cloud easily and economically.

Azure NotebookAzure Notebooks

Azure Notebooks 是一种成本低、冲突少的服务,适合偏向于使用 Notebook 将代码部署到 Azure 的 R 开发人员。Azure Notebooks is a low-cost, low-friction method for R developers who prefer working with notebooks to bring their code to Azure. 它是一个免费的服务,面向使用 Jupyter(一个开源项目,可将 markdown 文本信息、可执行代码和图形合并到一个画布上)在浏览器中开发和运行代码的任何用户。It is a free service for anyone to develop and run code in their browser using Jupyter, which is an open-source project that enables combing markdown prose, executable code, and graphics onto a single canvas.

Azure Notebooks 的免费服务层是小规模项目的可行选项,因为它会将每个笔记本的进程限制为 4 GB 的内存和 1 GB 的数据集。The free service tier of Azure Notebooks is a viable option for small-scale projects, as it limits each notebook's process to 4 GB of memory and 1 GB data sets. 但是,如果需要超出这些限制的计算和数据处理能力,则可以在 Data Science Virtual Machine 实例中运行 Notebook。If you need compute and data power beyond these limitations, however, you can run notebooks in a Data Science Virtual Machine instance. 有关详细信息,请参阅管理和配置 Azure Notebooks 项目 - 计算层For more information, see Manage and configure Azure Notebooks projects - Compute tier.

Azure SQL DatabaseAzure SQL Database

Azure SQL 数据库是 Microsoft 提供的完全托管式的智能关系型云数据库服务。Azure SQL Database is Microsoft's intelligent, fully managed relational cloud database service. 它可以让你使用 SQL Server 的完整功能,省去了设置基础结构的麻烦。It allows you to use the full power of SQL Server without any hassle of setting up the infrastructure. 这包括SQL Server 中的机器学习服务,这是最新的 SQL 新增功能之一。This includes Machine Learning Services in SQL Server, which is one of the more recent additions to SQL.

此功能提供嵌入式预测分析和数据科学引擎,该引擎可将 SQL Server 数据库中的 R 代码作为存储过程、包含 R 语句的 T-SQL 脚本或包含 T-SQL 的 R 代码来执行。This feature offers an embedded, predictive analytics and data science engine that can execute R code within a SQL Server database as stored procedures, as T-SQL scripts containing R statements, or as R code containing T-SQL. 无需从数据库中提取数据并将其载入 R 环境,而可以直接将 R 代码载入数据库,使其连接数据一起运行。Instead of extracting data from the database and loading it into the R environment, you load your R code directly into the database and let it run right alongside the data.

从 2016 年开始,机器学习服务已划归到本地 SQL Server,是 Azure SQL 数据库的相对较新的功能。While Machine Learning Services has been part of on-premises SQL Server since 2016, it is relatively new to Azure SQL Database. 它目前以受限预览版的形式提供,同时在不断改善。It is currently in limited preview but will continue to evolve.

后续步骤Next steps


R 徽标为 © 2016 The R Foundation,根据 Creative Commons Attribution-ShareAlike 4.0 国际许可证的条款使用。The R logo is © 2016 The R Foundation and is used under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license.