Azure Cosmos DB: Perform graph analytics by using Spark and Apache TinkerPop Gremlin

Azure Cosmos DB is the globally distributed, multi-model database service from Microsoft. You can create and query document, key/value, and graph databases, all of which benefit from the global-distribution and horizontal-scale capabilities at the core of Azure Cosmos DB. Azure Cosmos DB supports online transaction processing (OLTP) graph workloads that use Apache TinkerPop Gremlin.

Spark is an Apache Software Foundation project that's focused on general-purpose online analytical processing (OLAP) data processing. Spark provides a hybrid in-memory/disk-based distributed computing model that is similar to the Hadoop MapReduce model. You can deploy Apache Spark in the cloud by using Azure HDInsight.

By combining Azure Cosmos DB and Spark, you can perform both OLTP and OLAP workloads when you use Gremlin. This quick-start article demonstrates how to run Gremlin queries against Azure Cosmos DB on an Azure HDInsight Spark cluster.

Prerequisites

Before you can run this sample, you must have the following prerequisites:

  • Azure HDInsight Spark cluster 2.0
  • JDK 1.8+ (If you don't have JDK, run apt-get install default-jdk.)
  • Maven (If you don't have Maven, run apt-get install maven.)
  • An Azure subscription (If you don't have an Azure subscription, create a free account before you begin.)

For information about how to set up an Azure HDInsight Spark cluster, see Provisioning HDInsight clusters.

Create an Azure Cosmos DB database account

First, create a database account with the Graph API by doing the following:

  1. In a new window, sign in to the Azure portal.

  2. In the left pane, select New > Databases > Azure Cosmos DB > Create.

    Azure portal "Databases" pane

  3. Under New account, specify the configuration that you want for this Azure Cosmos DB account.

    With Azure Cosmos DB, you can choose one of four programming models: Gremlin (graph), MongoDB, SQL (DocumentDB), and Table (key-value). Each model currently requires a separate account.

    In this Quick Start article, we program against the Graph API, so choose Gremlin (graph) as you fill out the form. If you have document data from a catalog app, key/value (table) data, or data that's migrated from a MongoDB app, realize that Azure Cosmos DB can provide a highly available, globally distributed database service platform for all your mission-critical applications.

    Fill in the fields on the New account blade by using the information in the following screenshot as a guide. Your values might be different from the values in the screenshot.

    "New account" blade

    Setting Suggested value Description
    ID Unique value A unique name that identifies this Azure Cosmos DB account. Because documents.azure.com is appended to the ID that you provide to create your URI, use a unique but identifiable ID. The ID must contain only lowercase letters, numbers, and the hyphen (-) character. It must contain from 3 to 50 characters.
    API Gremlin (graph) We program against the Graph API later in this article.
    Subscription Your subscription The Azure subscription that you want to use for this Azure Cosmos DB account.
    Resource group The same value as ID The new resource group name for your account. For simplicity, you can use the same name as your ID.
    Location The region closest to your users The geographic location in which to host your Azure Cosmos DB account. Choose the location closest to your users to give them the fastest access to the data.
  4. Select Create to create the account.

  5. On the toolbar, select the Notifications icon Notifications icon to monitor the deployment process.

    Azure portal "Notifications" pane

  6. When the Notifications window indicates the deployment succeeded, close the window. Open the new account from the All resources tile on the Dashboard.

    "All resources" tile

Add a collection

You can now use the Data Explorer tool in the Azure portal to create a graph database.

  1. In the Azure portal, in the menu on the left, select Data Explorer (Preview).

  2. Under Data Explorer (Preview), select New Graph. Then fill in the page by using the following information:

    Data Explorer in the Azure portal

    Setting Suggested value Description
    Database id sample-database The ID for your new database. Database names must be between 1 and 255 characters and can't contain / \ # ? or a trailing space.
    Graph id sample-graph The ID for your new graph. Graph names have the same character requirements as database IDs.
    Storage capacity 10 GB Leave the default value. This is the storage capacity of the database.
    Throughput 400 RUs Leave the default value. You can scale up the throughput later if you want to reduce latency.
    Partition key /firstName A partition key that distributes data evenly to each partition. Selecting the correct partition key is important in creating a performant graph. For more information, see Designing for partitioning.
  3. After the form is filled out, select OK.

Get Apache TinkerPop

Get Apache TinkerPop by doing the following:

  1. Remote to the master node of the HDInsight cluster ssh tinkerpop3-cosmosdb-demo-ssh.azurehdinsight.net.

  2. Clone the TinkerPop3 source code, build it locally, and install it to Maven cache.

    git clone https://github.com/apache/tinkerpop.git
    cd tinkerpop
    mvn clean install
    
  3. Install the Spark-Gremlin plug-in

    a. The installation of the plug-in is handled by Grape. Populate the repositories information for Grape so it can download the plug-in and its dependencies.

    Create the grape configuration file if it's not present at ~/.groovy/grapeConfig.xml. Use the following settings:

    <ivysettings>
    <settings defaultResolver="downloadGrapes"/>
    <resolvers>
        <chain name="downloadGrapes">
        <filesystem name="cachedGrapes">
            <ivy pattern="${user.home}/.groovy/grapes/[organisation]/[module]/ivy-[revision].xml"/>
            <artifact pattern="${user.home}/.groovy/grapes/[organisation]/[module]/[type]s/[artifact]-[revision].[ext]"/>
        </filesystem>
        <ibiblio name="codehaus" root="http://repository.codehaus.org/" m2compatible="true"/>
        <ibiblio name="central" root="http://central.maven.org/maven2/" m2compatible="true"/>
        <ibiblio name="jitpack" root="https://jitpack.io" m2compatible="true"/>
        <ibiblio name="java.net2" root="http://download.java.net/maven/2/" m2compatible="true"/>
        <ibiblio name="apache-snapshots" root="http://repository.apache.org/snapshots/" m2compatible="true"/>
        <ibiblio name="local" root="file:${user.home}/.m2/repository/" m2compatible="true"/>
        </chain>
    </resolvers>
    </ivysettings>
    

    b. Start Gremlin console bin/gremlin.sh.

    c. Install the Spark-Gremlin plug-in with version 3.3.0-SNAPSHOT, which you built in the previous steps:

    $ bin/gremlin.sh
    
            \,,,/
            (o o)
    -----oOOo-(3)-oOOo-----
    plugin activated: tinkerpop.server
    plugin activated: tinkerpop.utilities
    plugin activated: tinkerpop.tinkergraph
    gremlin> :install org.apache.tinkerpop spark-gremlin 3.3.0-SNAPSHOT
    ==>loaded: [org.apache.tinkerpop, spark-gremlin, 3.3.0-SNAPSHOT] - restart the console to use [tinkerpop.spark]
    gremlin> :q
    $ bin/gremlin.sh
    
            \,,,/
            (o o)
    -----oOOo-(3)-oOOo-----
    plugin activated: tinkerpop.server
    plugin activated: tinkerpop.utilities
    plugin activated: tinkerpop.tinkergraph
    gremlin> :plugin use tinkerpop.spark
    ==>tinkerpop.spark activated
    
  4. Check to see whether Hadoop-Gremlin is activated with :plugin list. Disable this plug-in, because it could interfere with the Spark-Gremlin plug-in :plugin unuse tinkerpop.hadoop.

Prepare TinkerPop3 dependencies

When you built TinkerPop3 in the previous step, the process also pulled all jar dependencies for Spark and Hadoop in the target directory. Use the jars that are pre-installed with HDI, and pull in additional dependencies only as necessary.

  1. Go to the Gremlin Console target directory at tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone.

  2. Move all jars under ext/ to lib/: find ext/ -name '*.jar' -exec mv {} lib/ \;.

  3. Remove all jar libraries under lib/ that are not in the following list:

    # TinkerPop3
    gremlin-console-3.3.0-SNAPSHOT.jar
    gremlin-core-3.3.0-SNAPSHOT.jar       
    gremlin-groovy-3.3.0-SNAPSHOT.jar     
    gremlin-shaded-3.3.0-SNAPSHOT.jar     
    hadoop-gremlin-3.3.0-SNAPSHOT.jar     
    spark-gremlin-3.3.0-SNAPSHOT.jar      
    tinkergraph-gremlin-3.3.0-SNAPSHOT.jar
    
    # Gremlin depedencies
    asm-3.2.jar                                
    avro-1.7.4.jar                             
    caffeine-2.3.1.jar                         
    cglib-2.2.1-v20090111.jar                  
    gbench-0.4.3-groovy-2.4.jar                
    gprof-0.3.1-groovy-2.4.jar                 
    groovy-2.4.9-indy.jar                      
    groovy-2.4.9.jar                           
    groovy-console-2.4.9.jar                   
    groovy-groovysh-2.4.9-indy.jar             
    groovy-json-2.4.9-indy.jar                 
    groovy-jsr223-2.4.9-indy.jar               
    groovy-sql-2.4.9-indy.jar                  
    groovy-swing-2.4.9.jar                     
    groovy-templates-2.4.9.jar                 
    groovy-xml-2.4.9.jar                       
    hadoop-yarn-server-nodemanager-2.7.2.jar   
    hppc-0.7.1.jar                             
    javatuples-1.2.jar                         
    jaxb-impl-2.2.3-1.jar                      
    jbcrypt-0.4.jar                            
    jcabi-log-0.14.jar                         
    jcabi-manifests-1.1.jar                    
    jersey-core-1.9.jar                        
    jersey-guice-1.9.jar                       
    jersey-json-1.9.jar                        
    jettison-1.1.jar                           
    scalatest_2.11-2.2.6.jar                   
    servlet-api-2.5.jar                        
    snakeyaml-1.15.jar                         
    unused-1.0.0.jar                           
    xml-apis-1.3.04.jar                        
    

Get the Azure Cosmos DB Spark connector

  1. Get the Azure Cosmos DB Spark connector azure-documentdb-spark-0.0.3-SNAPSHOT.jar and Cosmos DB Java SDK azure-documentdb-1.10.0.jar from Azure Cosmos DB Spark Connector on GitHub.

  2. Alternatively, you can build it locally. Because the latest version of Spark-Gremlin was built with Spark 1.6.1 and is not compatible with Spark 2.0.2, which is currently used in the Azure Cosmos DB Spark connector, you can build the latest TinkerPop3 code and install the jars manually. Do the following:

    a. Clone the Azure Cosmos DB Spark connector.

    b. Build TinkerPop3 (already done in previous steps). Install all TinkerPop 3.3.0-SNAPSHOT jars locally.

    mvn install:install-file -Dfile="gremlin-core-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=gremlin-core -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar
    mvn install:install-file -Dfile="gremlin-groovy-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=gremlin-groovy -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="gremlin-shaded-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=gremlin-shaded -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="hadoop-gremlin-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=hadoop-gremlin -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="spark-gremlin-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=spark-gremlin -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="tinkergraph-gremlin-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=tinkergraph-gremlin -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    

    c. Update tinkerpop.version azure-documentdb-spark/pom.xml to 3.3.0-SNAPSHOT.

    d. Build with Maven. The needed jars are placed in target and target/alternateLocation.

    git clone https://github.com/Azure/azure-cosmosdb-spark.git
    cd azure-documentdb-spark
    mvn clean package
    
  3. Copy the previously mentioned jars to a local directory at ~/azure-documentdb-spark:

    $ azure-documentdb-spark:
    mkdir ~/azure-documentdb-spark
    cp target/azure-documentdb-spark-0.0.3-SNAPSHOT.jar ~/azure-documentdb-spark
    cp target/alternateLocation/azure-documentdb-1.10.0.jar ~/azure-documentdb-spark
    

Distribute the dependencies to the Spark worker nodes

  1. Because the transformation of graph data depends on TinkerPop3, you must distribute the related dependencies to all Spark worker nodes.

  2. Copy the previously mentioned Gremlin dependencies, the CosmosDB Spark connector jar, and CosmosDB Java SDK to the worker nodes by doing the following:

    a. Copy all the jars into ~/azure-documentdb-spark.

    $ /home/sshuser/tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone:
    cp lib/* ~/azure-documentdb-spark
    

    b. Get the list of all Spark worker nodes, which you can find on Ambari Dashboard, in the Spark2 Clients list in the Spark2 section.

    c. Copy that directory to each of the nodes.

    scp -r ~/azure-documentdb-spark sshuser@wn0-cosmos:/home/sshuser
    scp -r ~/azure-documentdb-spark sshuser@wn1-cosmos:/home/sshuser
    ...
    

Set up the environment variables

  1. Find the HDP version of the Spark cluster. It is the directory name under /usr/hdp/ (for example, 2.5.4.2-7).

  2. Set hdp.version for all nodes. In Ambari Dashboard, go to YARN section > Configs > Advanced, and then do the following:

    a. In Custom yarn-site, add a new property hdp.version with the value of the HDP version on the master node.

    b. Save the configurations. There are warnings, which you can ignore.

    c. Restart the YARN and Oozie services as the notification icons indicate.

  3. Set the following environment variables on the master node (replace the values as appropriate):

    export HADOOP_GREMLIN_LIBS=/home/sshuser/tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone/ext/spark-gremlin/lib
    export CLASSPATH=$CLASSPATH:$HADOOP_CONF_DIR:/usr/hdp/current/spark2-client/jars/*:/home/sshuser/azure-documentdb-spark/*
    export HDP_VERSION=2.5.4.2-7
    export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
    

Prepare the graph configuration

  1. Create a configuration file with the Azure Cosmos DB connection parameters and Spark settings, and put it at tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone/conf/hadoop/gremlin-spark.properties.

    gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
    gremlin.hadoop.jarsInDistributedCache=true
    gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
    
    gremlin.hadoop.graphReader=com.microsoft.azure.documentdb.spark.gremlin.DocumentDBInputRDD
    gremlin.hadoop.graphWriter=com.microsoft.azure.documentdb.spark.gremlin.DocumentDBOutputRDD
    
    ####################################
    # SparkGraphComputer Configuration #
    ####################################
    spark.master=yarn
    spark.executor.memory=3g
    spark.executor.instances=6
    spark.serializer=org.apache.spark.serializer.KryoSerializer
    spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
    gremlin.spark.persistContext=true
    
    # Classpath for the driver and executors
    spark.driver.extraClassPath=/usr/hdp/current/spark2-client/jars/*:/home/sshuser/azure-documentdb-spark/*
    spark.executor.extraClassPath=/usr/hdp/current/spark2-client/jars/*:/home/sshuser/azure-documentdb-spark/*
    
    ######################################
    # DocumentDB Spark connector         #
    ######################################
    spark.documentdb.connectionMode=Gateway
    spark.documentdb.schema_samplingratio=1.0
    spark.documentdb.Endpoint=https://FILLIN.documents.azure.com:443/
    spark.documentdb.Masterkey=FILLIN
    spark.documentdb.Database=FILLIN
    spark.documentdb.Collection=FILLIN
    spark.documentdb.preferredRegions=FILLIN
    
  2. Update the spark.driver.extraClassPath and spark.executor.extraClassPath to include the directory of the jars that you distributed in the previous step, in this case /home/sshuser/azure-documentdb-spark/*.

  3. Provide the following details for Azure Cosmos DB:

    spark.documentdb.Endpoint=https://FILLIN.documents.azure.com:443/
    spark.documentdb.Masterkey=FILLIN
    spark.documentdb.Database=FILLIN
    spark.documentdb.Collection=FILLIN
    # Optional
    #spark.documentdb.preferredRegions=West\ US;West\ US\ 2
    

Load the TinkerPop graph, and save it to Azure Cosmos DB

To demonstrate how to persist a graph into Azure Cosmos DB, this example uses the TinkerPop predefined TinkerPop modern graph. The graph is stored in Kryo format, and it's provided in the TinkerPop repository.

  1. Because you are running Gremlin in YARN mode, you must make the graph data available in the Hadoop file system. Use the following commands to make a directory and copy the local graph file into it.

    $ tinkerpop:
    hadoop fs -mkdir /graphData
    hadoop fs -copyFromLocal ~/tinkerpop/data/tinkerpop-modern.kryo /graphData/tinkerpop-modern.kryo
    
  2. Temporarily update the gremlin-spark.properties file to use GryoInputFormat to read the graph. Also indicate inputLocation as the directory you create, as in the following:

    gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
    gremlin.hadoop.inputLocation=/graphData/tinkerpop-modern.kryo
    
  3. Start Gremlin Console, and then create the following computation steps to persist data to the configured Azure Cosmos DB collection:

    a. Create the graph graph = GraphFactory.open("conf/hadoop/gremlin-spark.properties").

    b. Use SparkGraphComputer for writing graph.compute(SparkGraphComputer.class).result(GraphComputer.ResultGraph.NEW).persist(GraphComputer.Persist.EDGES).program(TraversalVertexProgram.build().traversal(graph.traversal().withComputer(Computer.compute(SparkGraphComputer.class)),"gremlin-groovy","g.V()").create(graph)).submit().get().

    gremlin> graph = GraphFactory.open("conf/hadoop/gremlin-spark.properties")
    ==>hadoopgraph[gryoinputformat->documentdboutputrdd]
    gremlin> hg = graph.
                compute(SparkGraphComputer.class).
                result(GraphComputer.ResultGraph.NEW).
                persist(GraphComputer.Persist.EDGES).
                program(TraversalVertexProgram.build().
                    traversal(graph.traversal().withComputer(Computer.compute(SparkGraphComputer.class)), "gremlin-groovy", "g.V()").
                    create(graph)).
                submit().
                get() 
    ==>result[hadoopgraph[documentdbinputrdd->documentdboutputrdd],memory[size:1]]
    
  4. From Data Explorer, you can verify that the data has been persisted to Azure Cosmos DB.

Load the graph from Azure Cosmos DB, and run Gremlin queries

  1. To load the graph, edit gremlin-spark.properties to set graphReader to DocumentDBInputRDD:

    gremlin.hadoop.graphReader=com.microsoft.azure.documentdb.spark.gremlin.DocumentDBInputRDD
    
  2. Load the graph, traverse the data, and run Gremlin queries with it by doing the following:

    a. Start the Gremlin Console bin/gremlin.sh.

    b. Create the graph with the configuration graph = GraphFactory.open('conf/hadoop/gremlin-spark.properties').

    c. Create a graph traversal with SparkGraphComputer g = graph.traversal().withComputer(SparkGraphComputer).

    d. Run the following Gremlin graph queries:

    gremlin> graph = GraphFactory.open("conf/hadoop/gremlin-spark.properties")
    ==>hadoopgraph[documentdbinputrdd->documentdboutputrdd]
    gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
    ==>graphtraversalsource[hadoopgraph[documentdbinputrdd->documentdboutputrdd], sparkgraphcomputer]
    gremlin> g.V().count()
    ==>6
    gremlin> g.E().count()
    ==>6
    gremlin> g.V(1).out().values('name')
    ==>josh
    ==>vadas
    ==>lop
    gremlin> g.V().hasLabel('person').coalesce(values('nickname'), values('name'))
    ==>josh
    ==>peter
    ==>vadas
    ==>marko
    gremlin> g.V().hasLabel('person').
            choose(values('name')).
                option('marko', values('age')).
                option('josh', values('name')).
                option('vadas', valueMap()).
                option('peter', label())
    ==>josh
    ==>person
    ==>[name:[vadas],age:[27]]
    ==>29
    
Note

To see more detailed logging, set the log level in conf/log4j-console.properties to a more verbose level.

Next steps

In this quick-start article, you've learned how to work with graphs by combining Azure Cosmos DB and Spark.