Azure Cosmos DB: Perform graph analytics by using Spark and Apache TinkerPop Gremlin

Azure Cosmos DB is the globally distributed, multi-model database service from Microsoft. You can create and query document, key/value, and graph databases, all of which benefit from the global-distribution and horizontal-scale capabilities at the core of Azure Cosmos DB. Azure Cosmos DB supports online transaction processing (OLTP) graph workloads that use Apache TinkerPop Gremlin.

Spark is an Apache Software Foundation project that's focused on general-purpose online analytical processing (OLAP) data processing. Spark provides a hybrid in-memory/disk-based distributed computing model that is similar to the Hadoop MapReduce model. You can deploy Apache Spark in the cloud by using Azure HDInsight.

By combining Azure Cosmos DB and Spark, you can perform both OLTP and OLAP workloads when you use Gremlin. This quick-start article demonstrates how to run Gremlin queries against Azure Cosmos DB on an Azure HDInsight Spark cluster.


Before you can run this sample, you must have the following prerequisites:

  • Azure HDInsight Spark cluster 2.0
  • JDK 1.8+ (If you don't have JDK, run apt-get install default-jdk.)
  • Maven (If you don't have Maven, run apt-get install maven.)
  • An Azure subscription (If you don't have an Azure subscription, create a free account before you begin.)

For information about how to set up an Azure HDInsight Spark cluster, see Provisioning HDInsight clusters.

Create an Azure Cosmos DB database account

First, create a database account with the Graph API by doing the following:

  1. In a new browser window, sign in to the Azure portal.

  2. Click Create a resource > Databases > Azure Cosmos DB.

    Azure portal "Databases" pane

  3. In the New account page, enter the settings for the new Azure Cosmos DB account.

    Setting Suggested value Description
    ID Enter a unique name Enter a unique name to identify this Azure Cosmos DB account. Because is appended to the ID that you provide to create your URI, use a unique but identifiable ID.

    The ID can contain only lowercase letters, numbers, and the hyphen (-) character, and it must contain 3 to 50 characters.
    API Gremlin (graph) The API determines the type of account to create. Azure Cosmos DB provides five APIs to suits the needs of your application: SQL (document database), Gremlin (graph database), MongoDB (document database), Azure Table, and Cassandra, each which currently require a separate account.

    Select Gremlin (graph) because in this quickstart you are creating a graph that is queryable using Gremlin syntax.

    Learn more about the Graph API
    Subscription Your subscription Select Azure subscription that you want to use for this Azure Cosmos DB account.
    Resource Group Create new

    Then enter the same unique name as provided above in ID
    Select Create New, then enter a new resource-group name for your account. For simplicity, you can use the same name as your ID.
    Location Select the region closest to your users Select geographic location in which to host your Azure Cosmos DB account. Use the location that's closest to your users to give them the fastest access to the data.
    Enable geo-redundancy Leave blank This creates a replicated version of your database in a second (paired) region. Leave this blank.
    Pin to dashboard Select Select this box so that your new database account is added to your portal dashboard for easy access.

    Then click Create.

    The new account blade for Azure Cosmos DB

  4. The account creation takes a few minutes. Wait for the portal to display the Congratulations! Your Azure Cosmos DB account was created page.

    The Azure portal Notifications pane

Add a collection

You can now use the Data Explorer tool in the Azure portal to create a graph database.

  1. Click Data Explorer > New Graph.

    The Add Graph area is displayed on the far right, you may need to scroll right to see it.

    The Azure portal Data Explorer, Add Graph page

  2. In the Add graph page, enter the settings for the new graph.

    Setting Suggested value Description
    Database ID sample-database Enter sample-database as the name for the new database. Database names must be between 1 and 255 characters, and cannot contain / \ # ? or a trailing space.
    Graph ID sample-graph Enter sample-graph as the name for your new collection. Graph names have the same character requirements as database IDs.
    Storage Capacity Fixed (10 GB) Leave the default value of Fixed (10 GB). This value is the storage capacity of the database.
    Throughput 400 RUs Change the throughput to 400 request units per second (RU/s). If you want to reduce latency, you can scale up the throughput later.
  3. Once the form is filled out, click OK.

Get Apache TinkerPop

Get Apache TinkerPop by doing the following:

  1. Remote to the master node of the HDInsight cluster ssh

  2. Clone the TinkerPop3 source code, build it locally, and install it to Maven cache.

    git clone
    cd tinkerpop
    mvn clean install
  3. Install the Spark-Gremlin plug-in

    a. The installation of the plug-in is handled by Grape. Populate the repositories information for Grape so it can download the plug-in and its dependencies.

    Create the grape configuration file if it's not present at ~/.groovy/grapeConfig.xml. Use the following settings:

    <settings defaultResolver="downloadGrapes"/>
        <chain name="downloadGrapes">
        <filesystem name="cachedGrapes">
            <ivy pattern="${user.home}/.groovy/grapes/[organisation]/[module]/ivy-[revision].xml"/>
            <artifact pattern="${user.home}/.groovy/grapes/[organisation]/[module]/[type]s/[artifact]-[revision].[ext]"/>
        <ibiblio name="codehaus" root="" m2compatible="true"/>
        <ibiblio name="central" root="" m2compatible="true"/>
        <ibiblio name="jitpack" root="" m2compatible="true"/>
        <ibiblio name="java.net2" root="" m2compatible="true"/>
        <ibiblio name="apache-snapshots" root="" m2compatible="true"/>
        <ibiblio name="local" root="file:${user.home}/.m2/repository/" m2compatible="true"/>

    b. Start Gremlin console bin/

    c. Install the Spark-Gremlin plug-in with version 3.3.0-SNAPSHOT, which you built in the previous steps:

    $ bin/
            (o o)
    plugin activated: tinkerpop.server
    plugin activated: tinkerpop.utilities
    plugin activated: tinkerpop.tinkergraph
    gremlin> :install org.apache.tinkerpop spark-gremlin 3.3.0-SNAPSHOT
    ==>loaded: [org.apache.tinkerpop, spark-gremlin, 3.3.0-SNAPSHOT] - restart the console to use [tinkerpop.spark]
    gremlin> :q
    $ bin/
            (o o)
    plugin activated: tinkerpop.server
    plugin activated: tinkerpop.utilities
    plugin activated: tinkerpop.tinkergraph
    gremlin> :plugin use tinkerpop.spark
    ==>tinkerpop.spark activated
  4. Check to see whether Hadoop-Gremlin is activated with :plugin list. Disable this plug-in, because it could interfere with the Spark-Gremlin plug-in :plugin unuse tinkerpop.hadoop.

Prepare TinkerPop3 dependencies

When you built TinkerPop3 in the previous step, the process also pulled all jar dependencies for Spark and Hadoop in the target directory. Use the jars that are pre-installed with HDI, and pull in additional dependencies only as necessary.

  1. Go to the Gremlin Console target directory at tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone.

  2. Move all jars under ext/ to lib/: find ext/ -name '*.jar' -exec mv {} lib/ \;.

  3. Remove all jar libraries under lib/ that are not in the following list:

    # TinkerPop3
    # Gremlin depedencies

Get the Azure Cosmos DB Spark connector

  1. Get the Azure Cosmos DB Spark connector azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar and Cosmos DB Java SDK azure-documentdb-1.12.0.jar from Azure Cosmos DB Spark Connector on GitHub.

  2. Alternatively, you can build it locally. Because the latest version of Spark-Gremlin was built with Spark 1.6.1 and is not compatible with Spark 2.0.2, which is currently used in the Azure Cosmos DB Spark connector, you can build the latest TinkerPop3 code and install the jars manually. Do the following:

    a. Clone the Azure Cosmos DB Spark connector.

    b. Build TinkerPop3 (already done in previous steps). Install all TinkerPop 3.3.0-SNAPSHOT jars locally.

    mvn install:install-file -Dfile="gremlin-core-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=gremlin-core -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar
    mvn install:install-file -Dfile="gremlin-groovy-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=gremlin-groovy -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="gremlin-shaded-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=gremlin-shaded -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="hadoop-gremlin-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=hadoop-gremlin -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="spark-gremlin-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=spark-gremlin -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`
    mvn install:install-file -Dfile="tinkergraph-gremlin-3.3.0-SNAPSHOT.jar" -DgroupId=org.apache.tinkerpop -DartifactId=tinkergraph-gremlin -Dversion=3.3.0-SNAPSHOT -Dpackaging=jar`

    c. Update tinkerpop.version azure-documentdb-spark/pom.xml to 3.3.0-SNAPSHOT.

    d. Build with Maven. The needed jars are placed in target and target/alternateLocation.

    git clone
    cd azure-documentdb-spark
    mvn clean package
  3. Copy the previously mentioned jars to a local directory at ~/azure-documentdb-spark:

    $ azure-documentdb-spark:
    mkdir ~/azure-documentdb-spark
    cp target/azure-documentdb-spark-0.0.3-SNAPSHOT.jar ~/azure-documentdb-spark
    cp target/alternateLocation/azure-documentdb-1.10.0.jar ~/azure-documentdb-spark

Distribute the dependencies to the Spark worker nodes

  1. Because the transformation of graph data depends on TinkerPop3, you must distribute the related dependencies to all Spark worker nodes.

  2. Copy the previously mentioned Gremlin dependencies, the CosmosDB Spark connector jar, and CosmosDB Java SDK to the worker nodes by doing the following:

    a. Copy all the jars into ~/azure-documentdb-spark.

    $ /home/sshuser/tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone:
    cp lib/* ~/azure-documentdb-spark

    b. Get the list of all Spark worker nodes, which you can find on Ambari Dashboard, in the Spark2 Clients list in the Spark2 section.

    c. Copy that directory to each of the nodes.

    scp -r ~/azure-documentdb-spark sshuser@wn0-cosmos:/home/sshuser
    scp -r ~/azure-documentdb-spark sshuser@wn1-cosmos:/home/sshuser

Set up the environment variables

  1. Find the HDP version of the Spark cluster. It is the directory name under /usr/hdp/ (for example,

  2. Set hdp.version for all nodes. In Ambari Dashboard, go to YARN section > Configs > Advanced, and then do the following:

    a. In Custom yarn-site, add a new property hdp.version with the value of the HDP version on the master node.

    b. Save the configurations. There are warnings, which you can ignore.

    c. Restart the YARN and Oozie services as the notification icons indicate.

  3. Set the following environment variables on the master node (replace the values as appropriate):

    export HADOOP_GREMLIN_LIBS=/home/sshuser/tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone/ext/spark-gremlin/lib
    export CLASSPATH=$CLASSPATH:$HADOOP_CONF_DIR:/usr/hdp/current/spark2-client/jars/*:/home/sshuser/azure-documentdb-spark/*
    export HDP_VERSION=
    export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}

Prepare the graph configuration

  1. Create a configuration file with the Azure Cosmos DB connection parameters and Spark settings, and put it at tinkerpop/gremlin-console/target/apache-tinkerpop-gremlin-console-3.3.0-SNAPSHOT-standalone/conf/hadoop/

    # SparkGraphComputer Configuration #
    # Classpath for the driver and executors
    # DocumentDB Spark connector         #
  2. Update the spark.driver.extraClassPath and spark.executor.extraClassPath to include the directory of the jars that you distributed in the previous step, in this case /home/sshuser/azure-documentdb-spark/*.

  3. Provide the following details for Azure Cosmos DB:

    # Optional
    #spark.documentdb.preferredRegions=West\ US;West\ US\ 2

Load the TinkerPop graph, and save it to Azure Cosmos DB

To demonstrate how to persist a graph into Azure Cosmos DB, this example uses the TinkerPop predefined TinkerPop modern graph. The graph is stored in Kryo format, and it's provided in the TinkerPop repository.

  1. Because you are running Gremlin in YARN mode, you must make the graph data available in the Hadoop file system. Use the following commands to make a directory and copy the local graph file into it.

    $ tinkerpop:
    hadoop fs -mkdir /graphData
    hadoop fs -copyFromLocal ~/tinkerpop/data/tinkerpop-modern.kryo /graphData/tinkerpop-modern.kryo
  2. Temporarily update the file to use GryoInputFormat to read the graph. Also indicate inputLocation as the directory you create, as in the following:
  3. Start Gremlin Console, and then create the following computation steps to persist data to the configured Azure Cosmos DB collection:

    a. Create the graph graph ="conf/hadoop/").

    b. Use SparkGraphComputer for writing graph.compute(SparkGraphComputer.class).result(GraphComputer.ResultGraph.NEW).persist(GraphComputer.Persist.EDGES).program(,"gremlin-groovy","g.V()").create(graph)).submit().get().

    gremlin> graph ="conf/hadoop/")
    gremlin> hg = graph.
                    traversal(graph.traversal().withComputer(Computer.compute(SparkGraphComputer.class)), "gremlin-groovy", "g.V()").
  4. From Data Explorer, you can verify that the data has been persisted to Azure Cosmos DB.

Load the graph from Azure Cosmos DB, and run Gremlin queries

  1. To load the graph, edit to set graphReader to DocumentDBInputRDD:
  2. Load the graph, traverse the data, and run Gremlin queries with it by doing the following:

    a. Start the Gremlin Console bin/

    b. Create the graph with the configuration graph ='conf/hadoop/').

    c. Create a graph traversal with SparkGraphComputer g = graph.traversal().withComputer(SparkGraphComputer).

    d. Run the following Gremlin graph queries:

    gremlin> graph ="conf/hadoop/")
    gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
    ==>graphtraversalsource[hadoopgraph[documentdbinputrdd->documentdboutputrdd], sparkgraphcomputer]
    gremlin> g.V().count()
    gremlin> g.E().count()
    gremlin> g.V(1).out().values('name')
    gremlin> g.V().hasLabel('person').coalesce(values('nickname'), values('name'))
    gremlin> g.V().hasLabel('person').
                option('marko', values('age')).
                option('josh', values('name')).
                option('vadas', valueMap()).
                option('peter', label())


To see more detailed logging, set the log level in conf/ to a more verbose level.

Next steps

In this quick-start article, you've learned how to work with graphs by combining Azure Cosmos DB and Spark.