什麼是 Azure HDInsight 中的 ML 服務What is ML Services in Azure HDInsight


在 2017 年 9 月,Microsoft R Server 以全新的 Microsoft Machine Learning Server 或 ML Server 名稱發行。In September 2017, Microsoft R Server was released under the new name of Microsoft Machine Learning Server or ML Server. 因此,HDInsight 上的 R 伺服器叢集現在稱為 HDInsight 上的機器學習服務ML 服務叢集。Consequently, R Server cluster on HDInsight is now called Machine Learning Services or ML Services cluster on HDInsight. 如需 R 伺服器名稱變更的相關資訊,請參閱 Microsoft R 伺服器現為 Microsoft Machine Learning ServerFor more information on the R Server name change, see Microsoft R Server is now Microsoft Machine Learning Server.

在 Azure 中建立 HDInsight 叢集時,可選擇 Microsoft Machine Learning Server 作為部署選項。Microsoft Machine Learning Server is available as a deployment option when you create HDInsight clusters in Azure. 提供此選項的叢集類型稱為 ML 服務The cluster type that provides this option is called ML Services. 這項功能可讓資料科學家、統計學家以及 R 程式設計人員隨其所需存取 HDInsight 上可調整大小的分散式分析方法。This capability provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight.

HDInsight 上的 ML 服務所提供的最新功能,適用於幾乎任何大小的資料集上所進行的 R 型分析,且不論資料集是載入 Azure Blob 或 Data Lake 儲存體。ML Services on HDInsight provides the latest capabilities for R-based analytics on datasets of virtually any size, loaded to either Azure Blob or Data Lake storage. ML 服務叢集是根據開放原始碼 R 所建置,因此您建置的 R 型應用程式可以運用 8000 多個開放原始碼 R 套件中的任何一個。Since ML Services cluster is built on open-source R, the R-based applications you build can leverage any of the 8000+ open-source R packages. ScaleR 中的常式與 Microsoft 的巨量資料分析套件亦可供使用。The routines in ScaleR, Microsoft’s big data analytics package are also available.

叢集的邊緣節點提供便利的地方,以便連線到叢集以及執行 R 指令碼。The edge node of a cluster provides a convenient place to connect to the cluster and to run your R scripts. 有了邊緣節點之後,即可選擇跨邊緣節點伺服器的核心,執行 ScaleR 的平行分散式函數。With an edge node, you have the option of running the parallelized distributed functions of ScaleR across the cores of the edge node server. 您也可以使用 ScaleR 的 Hadoop Map Reduce 或 Apache Spark 計算內容,跨越叢集的節點來執行這些函數。You can also run them across the nodes of the cluster by using ScaleR’s Hadoop Map Reduce or Apache Spark compute contexts.

可以下載分析所產生的模型或預測,以便在內部部署使用。The models or predictions that result from analysis can be downloaded for on-premises use. 它們也可以在 Azure 中的其他地方實際運作,特別是透過 Azure Machine Learning Studio Web 服務They can also be operationalized elsewhere in Azure, in particular through Azure Machine Learning Studio web service.

開始使用 HDInsight 上的 ML 服務Get started with ML Services on HDInsight

若要在 Azure HDInsight 中建立 ML 服務叢集,請在使用 Azure 入口網站建立 HDInsight 叢集時,選取 ML 服務叢集類型。To create an ML Services cluster in Azure HDInsight, select the ML Services cluster type when creating an HDInsight cluster using the Azure portal. ML 服務叢集類型包括在叢集資料節點上的 ML Server,以及在邊緣節點上的 ML Server,可當作 ML 型分析的登陸區域。The ML Services cluster type includes ML Server on the data nodes of the cluster and on an edge node, which serves as a landing zone for ML Services-based analytics. 請參閱開始使用 HDInsight 上的 ML 服務,以了解如何建立叢集的相關逐步解說。See Getting Started with ML Services on HDInsight for a walkthrough on how to create the cluster.

為何選擇 HDInsight 中的 ML 服務?Why choose ML Services in HDInsight?

HDInsight 中的 ML 服務提供下列優點︰ML Services in HDInsight provides the following benefits:

根據 Microsoft 和開放原始碼的 AI 創新AI innovation from Microsoft and open-source

ML 服務包括一組高擴充性、分散式的演算法,例如 RevoscaleRrevoscalepymicrosoftML,可用於大小比實體記憶體還要大的資料,並且可透過分散式方式在各種平台上執行。ML Services includes highly scalable, distributed set of algorithms such as RevoscaleR, revoscalepy, and microsoftML that can work on data sizes larger than the size of physical memory, and run on a wide variety of platforms in a distributed manner. 深入了解產品隨附之 Microsoft 自訂 R 套件Python 套件的集合。Learn more about the collection of Microsoft's custom R packages and Python packages included with the product.

ML 服務會在單一企業等級平台上,將這些來自開放原始碼社群 (R、Python 及 AI 工具組) 的 Microsoft 創新和貢獻都串連在一起。ML Services bridges these Microsoft innovations and contributions coming from the open-source community (R, Python, and AI toolkits) all on top of a single enterprise-grade platform. 任何 R 或 Python 開放原始碼機器學習套件,都可以與任何來自 Microsoft 的專屬創新搭配運作。Any R or Python open-source machine learning package can work side by side with any proprietary innovation from Microsoft.

簡單、安全且高擴充性的運作和管理Simple, secure, and high-scale operationalization and administration

依賴傳統範例和環境的企業,會朝運作投入許多時間和精力。Enterprises relying on traditional paradigms and environments invest much time and effort towards operationalization. 這會導致成本和延遲擴大,包括:模型的平移時間、讓它們保持有效及最新狀態的反覆項目、法規核准,以及管理整個運作的權限。This results in inflated costs and delays including the translation time for models, iterations to keep them valid and current, regulatory approval, and managing permissions through operationalization.

ML 服務提供企業級的運作,也就是在機器學習模型完成後,只要點擊幾下就可以產生 Web 服務 API。ML Services offers enterprise grade operationalization, in that, after a machine learning model is completed, it takes just a few clicks to generate web services APIs. 這些 Web 服務裝載於伺服器方格或雲端中,而且可以與企業營運應用程式整合。These web services are hosted on a server grid in the cloud and can be integrated with line-of-business applications. 部署到彈性方格的能力可讓您根據您的商務需求 (針對批次和即時評分) 順暢地進行調整。The ability to deploy to an elastic grid lets you scale seamlessly with the needs of your business, both for batch and real-time scoring. 如需指示,請參閱讓 HDInsight 上的 ML 服務能運作For instructions, see Operationalize ML Services on HDInsight.

HDInsight 上的 ML 服務主要功能Key features of ML Services on HDInsight

HDInsight 上的 ML 服務包含下列功能。The following features are included in ML Services on HDInsight.

功能分類Feature category 說明Description
已啟用 RR-enabled R 套件適用於以 R 撰寫的解決方案、具有 R 的開放原始碼散佈,以及用於執行指令碼的執行階段基礎結構。R packages for solutions written in R, with an open source distribution of R, and run-time infrastructure for script execution.
已啟用 PythonPython-enabled Python 套件適用於以 Python 撰寫的解決方案、具有 Python 的開放原始碼散佈,以及用於執行指令碼的執行階段基礎結構。Python modules for solutions written in Python, with an open source distribution of Python, and run-time infrastructure for script execution.
預先定型的模型Pre-trained models 針對視覺化分析和文字情感分析,準備好要對您提供的資料進行評分。For visual analysis and text sentiment analysis, ready to score data you provide.
部署和取用Deploy and consume 讓您的伺服器能運作,並將解決方案部署為 Web 服務。Operationalize your server and deploy solutions as a web service.
遠端執行Remote execution 從用戶端工作站中,在您網路上的 ML 服務啟動遠端工作階段。Start remote sessions on ML Services cluster on your network from your client workstation.

HDInsight 上適用於 ML 服務的資料儲存體選項Data storage options for ML Services on HDInsight

HDInsight 叢集之 HDFS 檔案系統的預設儲存體可以與 Azure 儲存體帳戶或 Azure Data Lake Storage 產生關聯。Default storage for the HDFS file system of HDInsight clusters can be associated with either an Azure Storage account or an Azure Data Lake Storage. 此關聯可確保在分析期間,上傳至叢集儲存體的任何資料皆保有永續性,而且即使刪除叢集之後還是可以取用資料。This association ensures that whatever data is uploaded to the cluster storage during analysis is made persistent and the data is available even after the cluster is deleted. 有各種工具可用於處理將資料傳輸到您所選儲存體選項的作業,包括儲存體帳戶的入口網站型上傳工具,以及 AzCopy 公用程式。There are various tools for handling the data transfer to the storage option that you select, including the portal-based upload facility of the storage account and the AzCopy utility.

不論使用中的儲存體選項為何,您都可以選擇在叢集佈建程序期間,啟用其他 Blob 和 Data Lake 儲存體的存取權。You have the option of enabling access to additional Blob and Data lake stores during the cluster provisioning process regardless of the primary storage option in use. 如需有關新增其他帳戶存取權的詳細資訊,請參閱開始使用 HDInsight 上的 ML 服務See Getting started with ML Services on HDInsight for information on adding access to additional accounts. 若要深入了解如何使用多個儲存體帳戶,請參閱適用於 HDInsight 上 ML 服務的 Azure 儲存體選項文章。See Azure Storage options for ML Services on HDInsight article to learn more about using multiple storage accounts.

您也可以將 Azure 檔案作為在邊緣節點上使用的儲存體選項。You can also use Azure Files as a storage option for use on the edge node. Azure 檔案可讓您將建立於 Azure 儲存體的檔案共用掛接至 Linux 檔案系統。Azure Files enables you to mount a file share that was created in Azure Storage to the Linux file system. 如需適用於 HDInsight 叢集上 ML 服務的這些資料儲存體選項詳細資訊,請參閱適用於 HDInsight 上 ML 服務的 Azure 儲存體選項For more information about these data storage options for ML Services on HDInsight cluster, see Azure Storage options for ML Services on HDInsight.

存取 ML 服務邊緣節點Access ML Services edge node

您可以使用瀏覽器連線到邊緣節點上的 Microsoft ML Server。You can connect to Microsoft ML Server on the edge node using a browser. 它預設會在叢集建立期間安裝。It is installed by default during cluster creation. 如需詳細資訊,請參閱開始使用 HDInsight 上的 ML 服務For more information, see Get stared with ML Services on HDInsight. 您也可以使用 SSH/PuTTY 來存取 R 主控台,從命令列連線到叢集邊緣節點。You can also connect to the cluster edge node from the command line by using SSH/PuTTY to access the R console.

開發和執行 R 指令碼Develop and run R scripts

您所建立與執行的 R 指令碼,可以任意運用 8000 多種開放原始碼 R 套件,以及 ScaleR 程式庫中的平行與分散式常式。The R scripts you create and run can use any of the 8000+ open-source R packages in addition to the parallelized and distributed routines available in the ScaleR library. 一般而言,以 ML 服務在邊緣節點上執行的指令碼,會在該節點上的 R 解譯器內執行。In general, a script that is run with ML Services on the edge node runs within the R interpreter on that node. 但是必須以設定為 Hadoop Map Reduce (RxHadoopMR) 或 Spark (RxSpark) 之計算內容呼叫 ScaleR 函數的那些步驟除外。The exceptions are those steps that need to call a ScaleR function with a compute context that is set to Hadoop Map Reduce (RxHadoopMR) or Spark (RxSpark). 在此情況中,函數會以分散方式,跨越與參考資料相關聯之叢集的那些資料 (工作) 節點執行。In this case, the function runs in a distributed fashion across those data (task) nodes of the cluster that are associated with the data referenced. 如需不同計算內容選項的詳細資訊,請參閱適用於 HDInsight 上 ML 服務的計算內容選項For more information about the different compute context options, see Compute context options for ML Services on HDInsight.

模型運作Operationalize a model

完成資料模型化時,可以在 Azure 中或內部部署中運作模型,為新的資料進行預測。When your data modeling is complete, you can operationalize the model to make predictions for new data either from Azure or on-premises. 這個程序稱為評分。This process is known as scoring. 評分可在 HDInsight、Azure Machine Learning 或內部部署中完成。Scoring can be done in HDInsight, Azure Machine Learning, or on-premises.

在 HDInsight 中評分Score in HDInsight

若要在 HDInsight 中評分,可以針對已載入至儲存體帳戶的新資料檔案,撰寫可呼叫模型的 R 函數以進行預測。To score in HDInsight, write an R function that calls your model to make predictions for a new data file that you've loaded to your storage account. 然後將預測存回儲存體帳戶。Then, save the predictions back to the storage account. 您可以視需要在叢集的邊緣節點上執行這個常式,或使用排程作業來進行。You can run this routine on-demand on the edge node of your cluster or by using a scheduled job.

Azure Machine Learning 中的評分 (AML)Score in Azure Machine Learning (AML)

若要使用 Azure Machine Learning 來評分,請使用 Azure Machine Learning R 套件開放原始碼 (稱為 AzureML) 來將您的模型發佈為 Azure Web 服務。To score using Azure Machine Learning, use the open-source Azure Machine Learning R package known as AzureML to publish your model as an Azure web service. 為了方便起見,此套件已預先安裝在邊緣節點。For convenience, this package is pre-installed on the edge node. 接下來,使用 Azure Machine Learning 中的功能建立 Web 服務的使用者介面,然後呼叫評分所需的 Web 服務。Next, use the facilities in Azure Machine Learning to create a user interface for the web service, and then call the web service as needed for scoring.

如果您選擇這個選項,就必須將所有 ScaleR 模型物件轉換成對等的開放原始碼模型物件,才可搭配 Web 服務使用。If you choose this option, you must convert any ScaleR model objects to equivalent open-source model objects for use with the web service. 針對此轉換,您可以使用 ScaleR 強制型轉函數 (例如,適用於集成模型的 as.randomForest()) 來完成。Use ScaleR coercion functions, such as as.randomForest() for ensemble-based models, for this conversion.

內部部署評分Score on-premises

若要在建立模型之後進行內部部署評分,可以在 R 中將模型序列化,下載它、將它還原序列化,然後使用它進行新資料的評分。To score on-premises after creating your model, you can serialize the model in R, download it, de-serialize it, and then use it for scoring new data. 您可以使用稍早在在 HDInsight 中評分中所述的方法,或使用 Web 服務 (英文) 進行新資料的評分。You can score new data by using the approach described earlier in Score in HDInsight or by using web services.

維護叢集Maintain the cluster

安裝及維護 R 套件Install and maintain R packages

由於 R 指令碼大多數的步驟會在邊緣節點上執行,因此在邊緣節點上需要有大多數您所使用的 R 套件。Most of the R packages that you use are required on the edge node since most steps of your R scripts run there. 若要在邊緣節點上安裝其他 R 套件,可以在 R 中使用 install.packages() 方法。To install additional R packages on the edge node, you can use the install.packages() method in R.

如果您只是在整個叢集中使用來自 ScaleR 程式庫的常式,則通常不需要在資料節點上安裝其他 R 套件。If you are just using routines from the ScaleR library across the cluster, you do not usually need to install additional R packages on the data nodes. 但是,您可能需要其他套件,以支援在資料節點上使用 rxExecRxDataStep 執行。However, you might need additional packages to support the use of rxExec or RxDataStep execution on the data nodes.

在這種情況下,您可以在建立叢集之後,使用指令碼動作來安裝其他套件。In such cases, the additional packages can be installed with a script action after you create the cluster. 如需詳細資訊,請參閱管理 HDInsight 叢集中的 ML 服務For more information, see Manage ML Services in HDInsight cluster.

變更 Apache Hadoop MapReduce 記憶體設定Change Apache Hadoop MapReduce memory settings

您可以在執行 MapReduce 作業時,修改叢集以變更 ML 服務的可用記憶體數量。A cluster can be modified to change the amount of memory that is available to ML Services when it is running a MapReduce job. 若要修改叢集,請使用可透過叢集的 Azure 入口網站刀鋒視窗存取的 Apache Ambari UI。To modify a cluster, use the Apache Ambari UI that's available through the Azure portal blade for your cluster. 如需如何存取叢集的 Ambari UI 的指示,請參閱使用 Ambari Web UI 管理 HDInsight 叢集For instructions about how to access the Ambari UI for your cluster, see Manage HDInsight clusters using the Ambari Web UI.

您也可以在對 RxHadoopMR 的呼叫中使用 Hadoop 參數,變更 ML 服務可用的記憶體數量,如下所示:It is also possible to change the amount of memory that is available to ML Services by using Hadoop switches in the call to RxHadoopMR as follows:

hadoopSwitches = "-libjars /etc/hadoop/conf -Dmapred.job.map.memory.mb=6656"  

調整叢集的大小Scale your cluster

您可以透過入口網站,相應增加或相應減少 HDInsight 上的現有 ML 服務叢集。An existing ML Services cluster on HDInsight can be scaled up or down through the portal. 藉由相應增加,您可以取得處理較大型工作所需的額外處理能力,或於叢集閒置時調回處理能力。By scaling up, you can gain the additional capacity that you might need for larger processing tasks, or you can scale back a cluster when it is idle. 如需如何調整叢集規模的指示,請參閱管理 HDInsight 叢集For instructions about how to scale a cluster, see Manage HDInsight clusters.

維護系統Maintain the system

離峰期間,系統會在 HDInsight 叢集的基底 Linux VM 上執行維護,以套用 OS 修補程式和其他更新。Maintenance to apply OS patches and other updates is performed on the underlying Linux VMs in an HDInsight cluster during off-hours. 一般而言,維護作業會在每星期一和星期四上午 3:30 (以 VM 的本地時間為準) 進行。Typically, maintenance is done at 3:30 AM (based on the local time for the VM) every Monday and Thursday. 執行更新時,系統會以一次僅影響四分之一以內叢集的方式進行。Updates are performed in such a way that they don't impact more than a quarter of the cluster at a time.

由於前端節點是備援節點,且並非所有資料節點都會受到影響,因此在這段期間執行的任何工作可能會變慢。Since the head nodes are redundant and not all data nodes are impacted, any jobs that are running during this time might slow down. 不過,應該都可執行完成。However, they should still run to completion. 除非發生需要重建叢集的嚴重失敗,否則您擁有的任何自訂軟體或本機資料,在這些維護事件中皆會保留。Any custom software or local data that you have is preserved across these maintenance events unless a catastrophic failure occurs that requires a cluster rebuild.

HDInsight 上適用於 ML 服務的 IDE 選項IDE options for ML Services on HDInsight

HDInsight 叢集的 Linux 邊緣節點,是 R 型分析的登陸區域。The Linux edge node of an HDInsight cluster is the landing zone for R-based analysis. HDInsight 最近的幾個版本,提供在邊緣節點上將 RStudio Server 安裝為瀏覽器型 IDE 的預設選項。Recent versions of HDInsight provide a default installation of RStudio Server on the edge node as a browser-based IDE. 比起僅使用 R 主控台,使用 RStudio Server 當作開發及執行 R 指令碼的 IDE 可大幅提升生產力。Use of RStudio Server as an IDE for the development and execution of R scripts can be considerably more productive than just using the R console.

此外,您可以安裝電腦 IDE 並透過使用遠端 MapReduce 或 Spark 計算內容,使用它來存取叢集。Additionally, you can install a desktop IDE and use it to access the cluster through use of a remote MapReduce or Spark compute context. 選項包括 Microsoft 的 Visual Studio R 工具 (RTVS)、RStudio 與 Walware 的 Eclipse 型 StatETOptions include Microsoft’s R Tools for Visual Studio (RTVS), RStudio, and Walware’s Eclipse-based StatET.

此外,透過 SSH 或 PuTTY 連線,您就可以在 Linux 命令提示字元中輸入 R 來存取邊緣節點上的 R 主控台。Additionally, you can access the R console on the edge node by typing R at the Linux command prompt after connecting via SSH or PuTTY. 使用主控台介面時,在另一個視窗中針對 R 指令碼開發執行文字編輯器,並視需要剪下指令碼區段並貼上到 R 主控台中是非常方便的。When using the console interface, it is convenient to run a text editor for R script development in another window, and cut and paste sections of your script into the R console as needed.


與 ML 服務 HDInsight 叢集相關聯的價格結構,類似於 HDInsight 叢集類型的價格。The prices that are associated with an ML Services HDInsight cluster are structured similarly to the prices for other HDInsight cluster types. 以各種名稱、資料和邊緣節點的基礎 VM 大小為基準,再加上彈性工時加費。They are based on the sizing of the underlying VMs across the name, data, and edge nodes, with the addition of a core-hour uplift. 如需詳細資訊,請參閱 HDInsight 定價For more information, see HDInsight pricing.

後續步驟Next steps

若要深入了解如何使用 HDInsight 叢集上的 ML 服務,請參閱下列主題:To learn more about how to use ML Services on HDInsight clusters, see the following topics: