Use SSH with HDInsight (Hadoop) from Windows, Linux, Unix, or OS X

Secure Shell (SSH) allows you to log in to a Linux-based HDInsight cluster and run commands using a command line interface. This document provides basic information about SSH and specific information about using SSH with HDInsight.

What is SSH?

SSH is a cryptographic network protocol that allows you to securely communicate with a remote server over an unsecured network. SSH is used to provide a secure command-line login to a remote server. In this case, the head nodes or edge node of an HDInsight cluster.

You can also use SSH to tunnel network traffic from your client to the HDInsight cluster. Using a tunnel allows you to access services on the HDInsight cluster that are not exposed directly to the internet. For more information on using SSH tunneling with HDInsight, see Use SSH tunneling with HDInsight.

SSH clients

Many operating systems provide SSH client functionality through the ssh and scp command line utilities.

  • ssh: A general SSH client that can be used to establish a remote command line session and create tunnels.
  • scp: A utility that copies files between local and remote systems using the SSH protocol.

Historically, Windows has not provided an SSH client until Windows 10 Anniversary Edition. This version of Windows includes the Bash on Windows 10 feature for developers, which provides ssh, scp and other Linux commands. For more information on using Bash on Windows 10, see Bash on Ubuntu on Windows.

If you use Windows and do not have access to Bash on Windows 10, we recommend the following SSH clients:

  • Git For Windows: Provides the ssh and scp command line utilities.
  • PuTTY: Provides a graphical SSH client.
  • MobaXterm: Provides a graphical SSH client.
  • Cygwin: Provides the ssh and scp command line utilities.
Note

The steps in this document assume that you have access to the ssh command. If you are using a client such as puTTY or MobaXterm, consult the documentation for that product for the equivalent command and parameters.

SSH Authentication

An SSH connection can be authenticated using either a password or public-key cryptography (https://en.wikipedia.org/wiki/Public-key_cryptography). Using a key is the most secure option, as it is not vulnerable to many of the attacks that passwords are. However creating and managing keys is more complicated than using a password.

Using public-key cryptography involves creating a public and private key pair.

  • The public key is loaded into the nodes of your HDInsight cluster, or any other service that you wish to use with public-key cryptography.

  • The private key is what you present to the HDInsight cluster when you log in using an SSH client, to verify your identity. Protect this private key. Do not share it.

    You can add additional security by creating a passphrase for the private key. You must provide this passphrase before the key can be used.

Create a public and private key

The ssh-keygen utility is the easiest way to create a public and private key pair for use with HDInsight. From a command line, use the following command to create a new key pair for use with HDInsight:

Note

If you are using a GUI SSH client such as MobaXTerm or puTTY, consult the documentation for your client on how to generate keys.

ssh-keygen -t rsa -b 2048

You are prompted for the following information:

  • The file location: The location defaults to ~/.ssh/id_rsa.

  • An optional passphrase: If you enter a passphrase, you must reenter it when authenticating to your HDInsight cluster.

Important

The passphrase is a password for the private key. Any time you use the private key to authenticate, you must provide the passphrase before the key can be used. If someone gets your private key, they will be unable to use it without the passphrase.

If you forget the passphrase, there is no way to reset or recover it.

After the command finishes, you will have two new files:

  • id_rsa: This file contains the private key.

    Warning

    You must restrict access to this file to prevent unauthorized access to services secured by the public key.

  • id_rsa.pub: This file contains the public key. You use this file when creating an HDInsght cluster.

    Note

    It doesn't matter who has access to the public key. By itself, all the public key can do is verify the private key. Services such as the SSH server use the public key to verify your identity when you authenticate using the private key.

Configure SSH on HDInsight

When you create a Linux-based HDInsight cluster, you must provide an SSH username and either a password or public key. During cluster creation, this information is used to create a login on the HDInsight cluster nodes. The password or public key is used to secure the user account.

For more information on configuring SSH during cluster creation, see one of the following documents:

Additional SSH users

While additional SSH users can be added to the cluster after it has been created, it is not recommended.

  • You must manually add new SSH users to each node in the cluster.

  • New SSH users have the same access to HDInsight as the default user. There is no way to restrict access to data or jobs in HDInsight based on SSH user account.

To restrict access on a per-user basis, you must use a domain joined HDInsight cluster. Domain joined HDInsight uses Active Directory to control access to cluster resources.

Using a domain joined HDInsight cluster allows you to authenticate using Active Directory after connecting using SSH. Multiple users can connect using SSH and then authenticate to their Active Directory account after connecting. See the Domain joined HDInsight section for more information.

Connect to HDInsight

While all the nodes in an HDInsight cluster run the SSH server, you can only connect to the head nodes or edge nodes over the public internet.

  • To connect to the head nodes, use CLUSTERNAME-ssh.azurehdinsight.net, where CLUSTERNAME is the name of the HDInsight cluster. Connecting on port 22 (the default for SSH) connects to the primary head node. Port 23 connects to the secondary head node.

  • To connect to an edge node, use EDGENAME.CLUSTERNAME-ssh.azurehdinsight.net, where EDGENAME is the name of the edge node and CLUSTERNAME is the name of the HDInsight cluster. Use port 22 when connecting to the edge node.

The following examples demonstrate how to connect to the head nodes and edge node of a cluster named myhdi using an SSH username of sshuser. The edge node is named myedge.

To do this... Use this...
Connect to the primary head node ssh sshuser@myhdi-ssh.azurehdinsight.net
Connect to the secondary head node ssh -p 23 sshuser@myhdi-ssh.azurehdinsight.net
Connect to the edge node ssh sshuser@edge.myhdi-ssh.azurehdinsight.net

If you use a password to secure the SSH account, you are prompted to enter the password.

If you use a public key to secure the SSH account, you may need to specify the path to the matching private key by using the -i switch. The following example demonstrates using the -i switch:

ssh -i /path/to/public.key sshuser@myhdi-ssh.azurehdinsight.net

Connect to other nodes

The worker nodes and Zookeeper nodes are not directly accessible from outside the cluster, but they can be accessed from the cluster head nodes or edge nodes. The following are the general steps to accomplish this:

  1. Use SSH to connect to a head or edge node:

     ssh sshuser@myhdi-ssh.azurehdinsight.net
    
  2. From the SSH connection to the head or edge node, use the ssh command to connect to a worker node in the cluster:

     ssh sshuser@wn0-myhdi
    

    To retrieve a list of the worker nodes in the cluster, see the example of how to retrieve the fully qualified domain name of cluster nodes in the Manage HDInsight by using the Ambari REST API document.

If the SSH account is secured using a password, you are asked to enter the password and the connection is established.

If you use an SSH key to authenticate your user account, you must make sure that your local environment is configured for SSH agent forwarding.

Important

The following steps assume a Linux/UNIX based system, and work with Bash on Windows 10. If these steps do not work for your system, you may need to consult the documentation for your SSH client.

  1. Using a text editor, open ~/.ssh/config. If this file doesn't exist, you can create it by entering touch ~/.ssh/config at a command line.

  2. Add the following to the file. Replace CLUSTERNAME with the name of your HDInsight cluster.

     Host CLUSTERNAME-ssh.azurehdinsight.net
       ForwardAgent yes
    

    This entry configures SSH agent forwarding for your HDInsight cluster.

  3. Test SSH agent forwarding by using the following command from the terminal:

     echo "$SSH_AUTH_SOCK"
    

    This command returns information similar to the following text:

     /tmp/ssh-rfSUL1ldCldQ/agent.1792
    

    If nothing is returned, this indicates that ssh-agent is not running. See the agent startup scripts information at Using ssh-agent with ssh (http://mah.everybody.org/docs/ssh) or consult your SSH client documentation for specific steps on installing and configuring ssh-agent.

  4. Once you have verified that ssh-agent is running, use the following to add your SSH private key to the agent:

     ssh-add ~/.ssh/id_rsa
    

    If your private key is stored in a different file, replace ~/.ssh/id_rsa with the path to the file.

Domain joined HDInsight

Domain-joined HDInsight integrates Kerberos with Hadoop in HDInsight. Because the SSH user is not an Active Directory domain user, you cannot run Hadoop commands until you authenticate with Active Directory. Use the following steps to authenticate your SSH session with Active Directory:

  1. Connect to a Domain-joined HDInsight cluster using the SSH as mentioned in the connect to HDInsight section. For example, the following command connects to an HDInsight cluster named myhdi using an SSH account named sshuser.

     ssh sshuser@myhdi-ssh.azurehdinsight.net
    
  2. Use the following to authenticate using a domain user and password:

     kinit
    

    When prompted, enter a domain user name and the password for the domain user.

    For more information on how to configure domain users for domain-joined HDInsight clusters, see Configure Domain-joined HDInisight clusters.

After authenticating using the kinit command, you can now use Hadoop commands such as hdfs dfs -ls / or hive.

SSH tunneling

SSH can be used to tunnel local requests, such as web requests, to the HDInsight cluster. The request will then be routed to the requested resource as if it had originated on the HDInsight cluster headnode.

Important

An SSH tunnel is a requirement for accessing the web UI for some Hadoop services. For example, both the Job History UI or Resource Manager UI can only be accessed using an SSH tunnel.

For more information on creating and using an SSH tunnel, see Use SSH Tunneling to access Ambari web UI, JobHistory, NameNode, Oozie, and other web UI's.

Next steps

Now that you understand how to authenticate by using an SSH key, learn how to use MapReduce with Hadoop on HDInsight.