Install and use Solr on HDInsight Hadoop clusters

In this topic, you will learn how to install Solr on Azure HDInsight by using Script Action. Solr is a powerful search platform and provides enterprise-level search capabilities on data managed by Hadoop. Once you have installed Solr on HDInsight cluster, you'll also learn how to search data by using Solr.

Important

The steps in this document require an HDInsight cluster that uses Linux. Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight Deprecation on Windows.

The sample script used in this topic creates a Solr cluster with a specific configuration. If you want to configure the Solr cluster with different collections, shards, schemas, replicas, etc., you must modify the script and Solr binaries accordingly.

What is Solr?

Apache Solr is an enterprise search platform that enables powerful full-text search on data. While Hadoop enables storing and managing vast amounts of data, Apache Solr provides the search capabilities to quickly retrieve the data. This topic provides instructions on how to customize an HDInsight cluster to install Solr.

Warning

Components provided with the HDInsight cluster are fully supported and Microsoft Support will help to isolate and resolve issues related to these components.

Custom components, such as Solr, receive commercially reasonable support to help you to further troubleshoot the issue. This might result in resolving the issue OR asking you to engage available channels for the open source technologies where deep expertise for that technology is found. For example, there are many community sites that can be used, like: MSDN forum for HDInsight, http://stackoverflow.com. Also Apache projects have project sites on http://apache.org, for example: Hadoop.

What the script does

This script makes the following changes to the HDInsight cluster:

  • Installs Solr into /usr/hdp/current/solr
  • Creates a new user, solrusr, which is used to run the Solr service
  • Sets solruser as the owner of /usr/hdp/current/solr
  • Adds an Upstart configuration that will start Solr if a cluster node restarts. Solr is also automatically started on the cluster nodes after installation

Install Solr using Script Actions

A sample script to install Solr on an HDInsight cluster is available at the following location.

https://hdiconfigactions.blob.core.windows.net/linuxsolrconfigactionv01/solr-installer-v01.sh

This section provides instructions on how to use the sample script when creating a new cluster by using the Azure portal.

Note

Azure PowerShell, the Azure CLI, the HDInsight .NET SDK, or Azure Resource Manager templates can also be used to apply script actions. You can also apply script actions to already running clusters. For more information, see Customize HDInsight clusters with Script Actions.

  1. Start provisioning a cluster by using the steps in Provision Linux-based HDInsight clusters, but do not complete provisioning.
  2. On the Optional Configuration blade, select Script Actions, and provide the information below:

  3. At the bottom of the Script Actions, use the Select button to save the configuration. Finally, use the Select button at the bottom of the Optional Configuration blade to save the optional configuration information.
  4. Continue provisioning the cluster as described in Provision Linux-based HDInsight clusters.

How do I use Solr in HDInsight?

Indexing data

You must start with indexing Solr with some data files. You can then use Solr to run search queries on the indexed data. Use the following steps to add some example data to Solr, and then query it:

  1. Connect to the HDInsight cluster using SSH:

     ssh USERNAME@CLUSTERNAME-ssh.azurehdinsight.net
    

    For more information on using SSH with HDInsight, see the following:

  2. Use the following commands to have Solr index sample data:

     cd /usr/hdp/current/solr/example/exampledocs
     java -jar post.jar solr.xml monitor.xml
    

    You'll see the following output on the console:

     POSTing file solr.xml
     POSTing file monitor.xml
     2 files indexed.
     COMMITting Solr index changes to http://localhost:8983/solr/update..
     Time spent: 0:00:01.624
    

    The post.jar utility indexes Solr with two sample documents, solr.xml and monitor.xml. These will be stored in collection1 within Solr.

  3. Use the following to query the REST API exposed by Solr:

     curl "http://localhost:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true"
    

    This issues a query against collection1 for any documents matching *:\* (encoded as *%3A* in the query string,) and that the response should be returned as JSON. The response should appear similar to the following:

         "response": {
             "numFound": 2,
             "start": 0,
             "maxScore": 1,
             "docs": [
               {
                 "id": "SOLR1000",
                 "name": "Solr, the Enterprise Search Server",
                 "manu": "Apache Software Foundation",
                 "cat": [
                   "software",
                   "search"
                 ],
                 "features": [
                   "Advanced Full-Text Search Capabilities using Lucene",
                   "Optimized for High Volume Web Traffic",
                   "Standards Based Open Interfaces - XML and HTTP",
                   "Comprehensive HTML Administration Interfaces",
                   "Scalability - Efficient Replication to other Solr Search Servers",
                   "Flexible and Adaptable with XML configuration and Schema",
                   "Good unicode support: héllo (hello with an accent over the e)"
                 ],
                 "price": 0,
                 "price_c": "0,USD",
                 "popularity": 10,
                 "inStock": true,
                 "incubationdate_dt": "2006-01-17T00:00:00Z",
                 "_version_": 1486960636996878300
               },
               {
                 "id": "3007WFP",
                 "name": "Dell Widescreen UltraSharp 3007WFP",
                 "manu": "Dell, Inc.",
                 "manu_id_s": "dell",
                 "cat": [
                   "electronics and computer1"
                 ],
                 "features": [
                   "30\" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast"
                 ],
                 "includes": "USB cable",
                 "weight": 401.6,
                 "price": 2199,
                 "price_c": "2199,USD",
                 "popularity": 6,
                 "inStock": true,
                 "store": "43.17614,-90.57341",
                 "_version_": 1486960637584081000
               }
             ]
           }
    

Using the Solr dashboard

The Solr dashboard is a web UI that allows you to work with Solr through your web browser. The Solr dashboard is not exposed directly on the Internet from your HDInsight cluster, but must be accessed using an SSH tunnel. For more information on using an SSH tunnel, see Use SSH Tunneling to access Ambari web UI, ResourceManager, JobHistory, NameNode, Oozie, and other web UI's

Once you have established an SSH tunnel, use the following steps to use the Solr dashboard:

  1. Determine the host name for the primary headnode:

    1. Use SSH to connect to the cluster on port 22. For example, ssh USERNAME@CLUSTERNAME-ssh.azurehdinsight.net where USERNAME is your SSH user name and CLUSTERNAME is the name of your cluster.

      For more information on using SSH, see the following documents:

    2. Use the following command to get the fully qualified hostname:

       hostname -f
      

      This will return a name similar to the following:

       hn0-myhdi-nfebtpfdv1nubcidphpap2eq2b.ex.internal.cloudapp.net
      

      This is the hostname that should be used in the following steps.

  2. In your browser, connect to http://HOSTNAME:8983/solr/#/, where HOSTNAME is the name you determined in the previous steps.

    The request should be routed through the SSH tunnel to the head node for your HDInsight cluster. You should see a page similar to the following:

    Image of Solr dashboard

  3. From the left pane, use the Core Selector drop-down to select collection1. Several entries should them appear below collection1.
  4. From the entries below collection1, select Query. Use the following values to populate the search page:

    • In the q text box, enter *:*. This will return all the documents that are indexed in Solr. If you want to search for a specific string within the documents, you can enter that string here.
    • In the wt text box, select the output format. Default is json.

      Finally, select the Execute Query button at the bottom of the search pate.

      Use Script Action to customize a cluster

      The output returns the two docs that we used for indexing Solr. The output resembles the following:

        "response": {
            "numFound": 2,
            "start": 0,
            "maxScore": 1,
            "docs": [
              {
                "id": "SOLR1000",
                "name": "Solr, the Enterprise Search Server",
                "manu": "Apache Software Foundation",
                "cat": [
                  "software",
                  "search"
                ],
                "features": [
                  "Advanced Full-Text Search Capabilities using Lucene",
                  "Optimized for High Volume Web Traffic",
                  "Standards Based Open Interfaces - XML and HTTP",
                  "Comprehensive HTML Administration Interfaces",
                  "Scalability - Efficient Replication to other Solr Search Servers",
                  "Flexible and Adaptable with XML configuration and Schema",
                  "Good unicode support: héllo (hello with an accent over the e)"
                ],
                "price": 0,
                "price_c": "0,USD",
                "popularity": 10,
                "inStock": true,
                "incubationdate_dt": "2006-01-17T00:00:00Z",
                "_version_": 1486960636996878300
              },
              {
                "id": "3007WFP",
                "name": "Dell Widescreen UltraSharp 3007WFP",
                "manu": "Dell, Inc.",
                "manu_id_s": "dell",
                "cat": [
                  "electronics and computer1"
                ],
                "features": [
                  "30\" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast"
                ],
                "includes": "USB cable",
                "weight": 401.6,
                "price": 2199,
                "price_c": "2199,USD",
                "popularity": 6,
                "inStock": true,
                "store": "43.17614,-90.57341",
                "_version_": 1486960637584081000
              }
            ]
          }
      

Starting and stopping Solr

If you need to manually stop or start Solar, use the following commands:

sudo stop solr

sudo start solr

Backup indexed data

As a good practice, you should back up the indexed data from the Solr cluster nodes onto Azure Blob storage. Perform the following steps to do so:

  1. Connect to the cluster using SSH, then use the following command to get the host name for the head node:

     hostname -f
    
  2. Use the following to create a snapshot of the indexed data. Replace HOSTNAME with the name returned from the previous command:

     curl http://HOSTNAME:8983/solr/replication?command=backup
    

    You should see a response like this:

     <?xml version="1.0" encoding="UTF-8"?>
     <response>
       <lst name="responseHeader">
         <int name="status">0</int>
         <int name="QTime">9</int>
       </lst>
       <str name="status">OK</str>
     </response>
    
  3. Next, change directories to /usr/hdp/current/solr/example/solr. There will be a subdirectory here for each collection. Each collection directory contains a data directory, which is where the snapshot for that collection is located.

    For example, if you used the steps earlier to index the sample documents, the /usr/hdp/current/solr/example/solr/collection1/data directory should now contain a directory named snapshot.########### where the #'s are the date and time of the snapshot.

  4. Create a compressed archive of the snapshot folder using a command similar to the following:

     tar -zcf snapshot.20150806185338855.tgz snapshot.20150806185338855
    

    This will create a new archive named snapshot.20150806185338855.tgz, which contains the contents of the snapshot.20150806185338855 directory.

  5. You can then store the archive to the cluster's primary storage using the following command:

    hadoop fs -copyFromLocal snapshot.20150806185338855.tgz /example/data

    Note

    You may want to create a dedicated directory for storing Solr snapshots. For example, hadoop fs -mkdir /solrbackup.

For more information on working with Solr backup and restores, see Making and restoring backups of SolrCores.

See also

  • Install and use Hue on HDInsight clusters. Hue is a web UI that makes it easy to create, run and save Pig and Hive jobs, as well as browse the default storage for your HDInsight cluster.
  • Install R on HDInsight clusters. Use cluster customization to install R on HDInsight Hadoop clusters. R is an open-source language and environment for statistical computing. It provides hundreds of built-in statistical functions and its own programming language that combines aspects of functional and object-oriented programming. It also provides extensive graphical capabilities.
  • Install Giraph on HDInsight clusters. Use cluster customization to install Giraph on HDInsight Hadoop clusters. Giraph allows you to perform graph processing by using Hadoop, and can be used with Azure HDInsight.
  • Install Hue on HDInsight clusters. Use cluster customization to install Hue on HDInsight Hadoop clusters. Hue is a set of Web applications used to interact with a Hadoop cluster.