Install Machine Learning Server for Hadoop

On a Hadoop cluster, Machine Learning Server must be installed on the edge node and all data nodes on a commercial distribution of Hadoop: Cloudera, HortonWorks, MapR. Optionally, you can install operationalization features on edge nodes only.

Machine Learning Server is engineered for the following architecture:

  • Hadoop Distributed File System (HDFS)
  • Apache YARN
  • MapReduce or Spark 2.0-2.1

System and setup requirements

Package managers

Installation is through package managers. Unlike previous releases, there is no install.sh script.

Package manager Platform
yum RHEL, CentOS
apt Ubuntu online
dpkg Ubuntu offline
zypper SUSE
rpm RHEL, CentOS, SUSE

Running setup on existing installations

The installation path for Machine Learning Server is new: /opt/microsoft/mlserver/9.2.1. However, if R Server 9.x is present, Machine Learning Server 9.2.1 finds R Server at the old path (/usr/lib64/microsoft-r/9.1.0) and replaces it with the new version.

There is no support for side-by-side installations of older and newer versions, nor is there support for hybrid versions (such as R Server 9.1 and Python 9.2.1). An installation is either entirely 9.2.1 or an earlier version.

Installation paths

After installation completes, software can be found at the following paths:

  • Install root: /opt/microsoft/mlserver/9.2.1
  • Microsoft R Open root: /opt/microsoft/ropen/3.4.1
  • Executables such as Revo64 and mlserver-python are at /usr/bin

1 - Edge node installation

Start here. Machine Learning Server is required on the edge node. You should run full setup, following the installation commands for the Linux operating system used by your cluster: Linux install > How to install.

Full setup gives you core components for both R and Python, machine learning algorithms and pretrained models, and operationalization. Operationalization features run on edge nodes, enabling additional ways of deploying and consuming script. For example, you can build and deploy web services, which allows you to invoke and access your solution programmatically, through a REST API.

Note

You cannot use operationalization on data nodes. Operationalization does not support Yarn queues and cannot run in a distributed manner.

2 - Data node installation

You can continue installation by running Setup on any data node, either sequentially or on multiple data nodes concurrently. There are two approaches for installing Machine Learning Server on data nodes.

Approach 1: Package managers for full installation

Again, we recommend running the full setup on every node. This approach is fast because package managers do most of the work, including adding the Hadoop package (microsoft-mlserver-hadoop-9.2.1) and setting it up for activation.

As before, follow the installation steps for the Linux operating system used by your cluster: Linux install > How to install.

Approach 2: Manual steps for partial installation

Alternatively, you can install a subset of packages. You might do this if you do not want operationalization on your data nodes, or if you want to exclude a specific language. Be prepared for more testing if you choose this approach. The packages are not specifically designed to run as standalone modules. Hence, unexpected problems are more likely if you leave some packages out.

  1. Install as root: sudo su

  2. Refer to the annotated package list and download individual packages from the package repo corresponding to your platform:

  3. Make a directory to contain your packages: hadoop fs -mkdir /tmp/mlsdatanode

  4. Copy the packages: hadoop fs -copyFromLocal /tmp/mlserver /tmp/mlsdatanode

  5. Switch to the directory: cd /tmp/mlsdatanode

  6. Install the packages using the tool and syntax for your platform:

    • On Ubuntu online: apt-get install *.rpm
    • On Ubuntu offline: dpkg -i *.deb
    • On CentOS and RHEL: yum install *.rpm
  7. Activate the server: /opt/microsoft/mlserver/9.2.1/bin/R/activate.sh

Repeat this procedure on remaining nodes.

Packages list

The following packages comprise a full Machine Learning Server installation:

 microsoft-mlserver-packages-r-9.2.1        ** core
 microsoft-mlserver-python-9.2.1            ** core
 microsoft-mlserver-packages-py-9.2.1       ** core
 microsoft-mlserver-hadoop-9.2.1            ** hadoop (required for hadoop)
 microsoft-mlserver-mml-r-9.2.1             ** microsoftml for R (optional)
 microsoft-mlserver-mml-py-9.2.1            ** microsoftml for Python (optional)
 microsoft-mlserver-mlm-r-9.2.1             ** pre-trained models (requires mml)
 microsoft-mlserver-mlm-py-9.2.1            ** pre-trained models (requires mml)
 microsoft-mlserver-adminutil-9.2           ** operationalization (optional)
 microsoft-mlserver-computenode-9.2         ** operationalization (optional)
 microsoft-mlserver-config-rserve-9.2       ** operationalization (optional) 
 microsoft-mlserver-dotnet-9.2              ** operationalization (optional)
 microsoft-mlserver-webnode-9.2             ** operationalization (optional)

The microsoft-mlserver-python-9.2.1 package provides Anaconda 4.2 with Python 3.5, executing as mlserver-python, found in /opt/microsoft/mlserver/9.2.1/bin/python/python

Microsoft R Open is required for R execution:

 microsoft-r-open-foreachiterators-3.4.1 
 microsoft-r-open-mkl-3.4.1
 microsoft-r-open-mro-3.4.1 

Microsoft .NET Core 1.1, used for operationalization, must be added to Ubuntu:

 dotnet-host
 dotnet-hostfxr-1.1.0
 dotnet-sharedframework-microsoft.netcore.app-1.1.2 

Additional open source packages could be required. The potential list of packages varies for each computer. Refer to offline installation for an example list.

Next steps

We recommend starting with How to use RevoScaleR with Spark or How to use RevoScaleR with Hadoop MapReduce.

For a list of functions that utilize Yarn and Hadoop infrastructure to process in parallel across the cluster, see Running a distributed analysis using RevoScaleR functions.

See also