Install Machine Learning Server for Hadoop
Applies to: Machine Learning Server 9.2.1 | 9.3 | 9.4
On a Spark cluster, Machine Learning Server must be installed on the edge node and all data nodes on a commercial distribution of Hadoop: Cloudera, HortonWorks, MapR. Optionally, you can install operationalization features on edge nodes only.
Machine Learning Server is engineered for the following architecture:
- Hadoop Distributed File System (HDFS)
- Apache YARN
- MapReduce or Spark 2.0-2.1 (Machine Learning Server 9.2.1 and 9.3) or Spark 2.4 (Machine Learning Server 9.4)
We recommend Spark for the processing framework.
These instructions use package managers to connect to Microsoft sites, download the distributions, and install the server. If you know and prefer working with gzip files on a local machine, you can download en_machine_learning_server_9.2.1_for_hadoop_x64_100353069.gz from Visual Studio Dev Essentials.
System and setup requirements
Native operating system must be a supported version of Hadoop on 64-bit Linux.
Minimum RAM is 8 GB (16 GB or more is recommended). Minimum disk space is 500 MB per node.
An internet connection. If you do not have an internet connection, use the offline installation instructions.
Root or super user permissions
Installation is through package managers. Unlike previous releases, there is no install.sh script.
|rpm||RHEL, CentOS, SUSE|
Running setup on existing installations
The installation path for Machine Learning Server is new:
/opt/microsoft/mlserver/9.4.7. However, if R Server 9.x is present, Machine Learning Server 9.x finds R Server at the old path (
/usr/lib64/microsoft-r/9.1.0) and replaces it with the new version.
There is no support for side-by-side installations of older and newer versions, nor is there support for hybrid versions (such as R Server 9.1 and Machine Learning Server 9.4). An installation is either entirely 9.4 or an earlier version.
After installation completes, software can be found at the following paths:
- Install root:
- Microsoft R Open root:
- Executables such as Revo64 and mlserver-python are at
1 - Edge node installation
Start here. Machine Learning Server is required on the edge node. You should run full setup, following the installation commands for the Linux operating system used by your cluster: Linux install > How to install.
Full setup gives you core components for both R and Python, machine learning algorithms and pretrained models, and operationalization. Operationalization features run on edge nodes, enabling additional ways of deploying and consuming script. For example, you can build and deploy web services, which allows you to invoke and access your solution programmatically, through a REST API.
You cannot use operationalization on data nodes. Operationalization does not support Yarn queues and cannot run in a distributed manner.
2 - Data node installation
You can continue installation by running Setup on any data node, either sequentially or on multiple data nodes concurrently. There are two approaches for installing Machine Learning Server on data nodes.
Approach 1: Package managers for full installation
Again, we recommend running the full setup on every node. This approach is fast because package managers do most of the work, including adding the Hadoop package (microsoft-mlserver-hadoop-9.4.7) and setting it up for activation.
As before, follow the installation steps for the Linux operating system used by your cluster: Linux install > How to install.
Approach 2: Manual steps for partial installation
Alternatively, you can install a subset of packages. You might do this if you do not want operationalization on your data nodes, or if you want to exclude a specific language. Be prepared for more testing if you choose this approach. The packages are not specifically designed to run as standalone modules. Hence, unexpected problems are more likely if you leave some packages out.
Install as root:
Refer to the annotated package list and download individual packages from the package repo corresponding to your platform:
Make a directory to contain your packages:
hadoop fs -mkdir /tmp/mlsdatanode
Copy the packages:
hadoop fs -copyFromLocal /tmp/mlserver /tmp/mlsdatanode
Switch to the directory:
Install the packages using the tool and syntax for your platform:
- On Ubuntu online:
apt-get install *.rpm
- On Ubuntu offline:
dpkg -i *.deb
- On CentOS and RHEL:
yum install *.rpm
- On Ubuntu online:
Activate the server:
Repeat this procedure on remaining nodes.
The following packages comprise a full Machine Learning Server installation:
microsoft-mlserver-packages-r-9.4.7 ** core microsoft-mlserver-python-9.4.7 ** core microsoft-mlserver-packages-py-9.4.7 ** core microsoft-mlserver-hadoop-9.4.7 ** hadoop (required for hadoop) microsoft-mlserver-mml-r-9.4.7 ** microsoftml for R (optional) microsoft-mlserver-mml-py-9.4.7 ** microsoftml for Python (optional) microsoft-mlserver-mlm-r-9.4.7 ** pre-trained models (requires mml) microsoft-mlserver-mlm-py-9.4.7 ** pre-trained models (requires mml) microsoft-mlserver-adminutil-9.4.7 ** operationalization (optional) microsoft-mlserver-computenode-9.4.7 ** operationalization (optional) microsoft-mlserver-config-rserve-9.4.7 ** operationalization (optional) microsoft-mlserver-dotnet-9.4.7 ** operationalization (optional) microsoft-mlserver-webnode-9.4.7 ** operationalization (optional) azure-cli-2.0.25-1.el7.x86_64 ** operationalization (optional)
The microsoft-mlserver-python-9.4.7 package provides Miniconda 4.5.12 with Python 3.7.1, executing as mlserver-python, found in
Microsoft R Open is required for R execution:
microsoft-r-open-foreachiterators-3.5.2 microsoft-r-open-mkl-3.5.2 microsoft-r-open-mro-3.5.2
Microsoft .NET Core 2.0, used for operationalization, must be added to Ubuntu:
dotnet-host-2.0.0 dotnet-hostfxr-2.0.0 dotnet-runtime-2.0.0
Additional open-source packages could be required. The potential list of packages varies for each computer. Refer to offline installation for an example list.
For a list of functions that utilize Yarn and Hadoop infrastructure to process in parallel across the cluster, see Running a distributed analysis using RevoScaleR functions.
R solutions that execute on the cluster can call functions from any R package. To add new R packages, you can use any of these approaches:
- Use the RevoScaleR rxExec function to add new packages.
- Manually run install.packages() on all nodes in Hadoop cluster (using distributed shell or some other mechanism).