Azure Cosmos DB for Gremlin graph support and compatibility with TinkerPop features

Article
04/04/2023

APPLIES TO: Gremlin

Azure Cosmos DB supports Apache Tinkerpop's graph traversal language, known as Gremlin. You can use the Gremlin language to create graph entities (vertices and edges), modify properties within those entities, perform queries and traversals, and delete entities.

Azure Cosmos DB Graph engine closely follows Apache TinkerPop traversal steps specification but there are differences in the implementation that are specific for Azure Cosmos DB. In this article, we provide a quick walkthrough of Gremlin and enumerate the Gremlin features that are supported by the API for Gremlin.

Compatible client libraries

The following table shows popular Gremlin drivers that you can use against Azure Cosmos DB:

Download	Source	Getting Started	Supported/Recommended connector version
.NET	Gremlin.NET on GitHub	Create Graph using .NET	3.4.13
Java	Gremlin JavaDoc	Create Graph using Java	3.4.13
Python	Gremlin-Python on GitHub	Create Graph using Python	3.4.13
Gremlin console	TinkerPop docs	Create Graph using Gremlin Console	3.4.13
Node.js	Gremlin-JavaScript on GitHub	Create Graph using Node.js	3.4.13
PHP	Gremlin-PHP on GitHub	Create Graph using PHP	3.1.0
Go Lang	Go Lang		This library is built by external contributors. The Azure Cosmos DB team doesn't offer any support or maintain the library.

Note

Gremlin client driver versions for 3.5.*, 3.6.* have known compatibility issues, so we recommend using the latest supported 3.4.* driver versions listed above. This table will be updated when compatibility issues have been addressed for these newer driver versions.

Supported Graph Objects

TinkerPop is a standard that covers a wide range of graph technologies. Therefore, it has standard terminology to describe what features are provided by a graph provider. Azure Cosmos DB provides a persistent, high concurrency, writeable graph database that can be partitioned across multiple servers or clusters.

The following table lists the TinkerPop features that are implemented by Azure Cosmos DB:

Category	Azure Cosmos DB implementation	Notes
Graph features	Provides Persistence and ConcurrentAccess. Designed to support Transactions	Computer methods can be implemented via the Spark connector.
Variable features	Supports Boolean, Integer, Byte, Double, Float, Long, String	Supports primitive types, is compatible with complex types via data model
Vertex features	Supports RemoveVertices, MetaProperties, AddVertices, MultiProperties, StringIds, UserSuppliedIds, AddProperty, RemoveProperty	Supports creating, modifying, and deleting vertices
Vertex property features	StringIds, UserSuppliedIds, AddProperty, RemoveProperty, BooleanValues, ByteValues, DoubleValues, FloatValues, IntegerValues, LongValues, StringValues	Supports creating, modifying, and deleting vertex properties
Edge features	AddEdges, RemoveEdges, StringIds, UserSuppliedIds, AddProperty, RemoveProperty	Supports creating, modifying, and deleting edges
Edge property features	Properties, BooleanValues, ByteValues, DoubleValues, FloatValues, IntegerValues, LongValues, StringValues	Supports creating, modifying, and deleting edge properties

Gremlin wire format

Azure Cosmos DB uses the JSON format when returning results from Gremlin operations. Azure Cosmos DB currently supports the JSON format. For example, the following snippet shows a JSON representation of a vertex returned to the client from Azure Cosmos DB:

  {
    "id": "a7111ba7-0ea1-43c9-b6b2-efc5e3aea4c0",
    "label": "person",
    "type": "vertex",
    "outE": {
      "knows": [
        {
          "id": "3ee53a60-c561-4c5e-9a9f-9c7924bc9aef",
          "inV": "04779300-1c8e-489d-9493-50fd1325a658"
        },
        {
          "id": "21984248-ee9e-43a8-a7f6-30642bc14609",
          "inV": "a8e3e741-2ef7-4c01-b7c8-199f8e43e3bc"
        }
      ]
    },
    "properties": {
      "firstName": [
        {
          "value": "Thomas"
        }
      ],
      "lastName": [
        {
          "value": "Andersen"
        }
      ],
      "age": [
        {
          "value": 45
        }
      ]
    }
  }

The properties used by the JSON format for vertices are described below:

Property	Description
`id`	The ID for the vertex. Must be unique (in combination with the value of `_partition` if applicable). If no value is provided, it will be automatically supplied with a GUID
`label`	The label of the vertex. This property is used to describe the entity type.
`type`	Used to distinguish vertices from non-graph documents
`properties`	Bag of user-defined properties associated with the vertex. Each property can have multiple values.
`_partition`	The partition key of the vertex. Used for graph partitioning.
`outE`	This property contains a list of out edges from a vertex. Storing the adjacency information with vertex allows for fast execution of traversals. Edges are grouped based on their labels.

Each property can store multiple values within an array.

Property	Description
`value`	The value of the property

And the edge contains the following information to help with navigation to other parts of the graph.

Property	Description
`id`	The ID for the edge. Must be unique (in combination with the value of `_partition` if applicable)
`label`	The label of the edge. This property is optional, and used to describe the relationship type.
`inV`	This property contains a list of in vertices for an edge. Storing the adjacency information with the edge allows for fast execution of traversals. Vertices are grouped based on their labels.
`properties`	Bag of user-defined properties associated with the edge.

Gremlin steps

Now let's look at the Gremlin steps supported by Azure Cosmos DB. For a complete reference on Gremlin, see TinkerPop reference.

step	Description	TinkerPop 3.2 Documentation
`addE`	Adds an edge between two vertices	addE step
`addV`	Adds a vertex to the graph	addV step
`and`	Ensures that all the traversals return a value	and step
`as`	A step modulator to assign a variable to the output of a step	as step
`by`	A step modulator used with `group` and `order`	by step
`coalesce`	Returns the first traversal that returns a result	coalesce step
`constant`	Returns a constant value. Used with `coalesce`	constant step
`count`	Returns the count from the traversal	count step
`dedup`	Returns the values with the duplicates removed	dedup step
`drop`	Drops the values (vertex/edge)	drop step
`executionProfile`	Creates a description of all operations generated by the executed Gremlin step	executionProfile step
`fold`	Acts as a barrier that computes the aggregate of results	fold step
`group`	Groups the values based on the labels specified	group step
`has`	Used to filter properties, vertices, and edges. Supports `hasLabel`, `hasId`, `hasNot`, and `has` variants.	has step
`inject`	Inject values into a stream	inject step
`is`	Used to perform a filter using a boolean expression	is step
`limit`	Used to limit number of items in the traversal	limit step
`local`	Local wraps a section of a traversal, similar to a subquery	local step
`not`	Used to produce the negation of a filter	not step
`optional`	Returns the result of the specified traversal if it yields a result else it returns the calling element	optional step
`or`	Ensures at least one of the traversals returns a value	or step
`order`	Returns results in the specified sort order	order step
`path`	Returns the full path of the traversal	path step
`project`	Projects the properties as a Map	project step
`properties`	Returns the properties for the specified labels	properties step
`range`	Filters to the specified range of values	range step
`repeat`	Repeats the step for the specified number of times. Used for looping	repeat step
`sample`	Used to sample results from the traversal	sample step
`select`	Used to project results from the traversal	select step
`store`	Used for non-blocking aggregates from the traversal	store step
`TextP.startingWith(string)`	String filtering function. This function is used as a predicate for the `has()` step to match a property with the beginning of a given string	TextP predicates
`TextP.endingWith(string)`	String filtering function. This function is used as a predicate for the `has()` step to match a property with the ending of a given string	TextP predicates
`TextP.containing(string)`	String filtering function. This function is used as a predicate for the `has()` step to match a property with the contents of a given string	TextP predicates
`TextP.notStartingWith(string)`	String filtering function. This function is used as a predicate for the `has()` step to match a property that doesn't start with a given string	TextP predicates
`TextP.notEndingWith(string)`	String filtering function. This function is used as a predicate for the `has()` step to match a property that doesn't end with a given string	TextP predicates
`TextP.notContaining(string)`	String filtering function. This function is used as a predicate for the `has()` step to match a property that doesn't contain a given string	TextP predicates
`tree`	Aggregate paths from a vertex into a tree	tree step
`unfold`	Unroll an iterator as a step	unfold step
`union`	Merge results from multiple traversals	union step
`V`	Includes the steps necessary for traversals between vertices and edges `V`, `E`, `out`, `in`, `both`, `outE`, `inE`, `bothE`, `outV`, `inV`, `bothV`, and `otherV` for	vertex steps
`where`	Used to filter results from the traversal. Supports `eq`, `neq`, `lt`, `lte`, `gt`, `gte`, and `between` operators	where step

The write-optimized engine provided by Azure Cosmos DB supports automatic indexing of all properties within vertices and edges by default. Therefore, queries with filters, range queries, sorting, or aggregates on any property are processed from the index, and served efficiently. For more information on how indexing works in Azure Cosmos DB, see our paper on schema-agnostic indexing.

Behavior differences

Azure Cosmos DB Graph engine runs breadth-first traversal while TinkerPop Gremlin is depth-first. This behavior achieves better performance in horizontally scalable system like Azure Cosmos DB.

Unsupported features

Gremlin Bytecode is a programming language agnostic specification for graph traversals. Azure Cosmos DB Graph doesn't support it yet. Use GremlinClient.SubmitAsync() and pass traversal as a text string.
property(set, 'xyz', 1) set cardinality isn't supported today. Use property(list, 'xyz', 1) instead. To learn more, see Vertex properties with TinkerPop.
The match() step isn't currently available. This step provides declarative querying capabilities.
Objects as properties on vertices or edges aren't supported. Properties can only be primitive types or arrays.
Sorting by array properties order().by(<array property>) isn't supported. Sorting is supported only by primitive types.
Non-primitive JSON types aren't supported. Use string, number, or true/false types. null values aren't supported.
GraphSONv3 serializer isn't currently supported. Use GraphSONv2 Serializer, Reader, and Writer classes in the connection configuration. The results returned by the Azure Cosmos DB for Gremlin don't have the same format as the GraphSON format.
Lambda expressions and functions aren't currently supported. This includes the .map{<expression>}, the .by{<expression>}, and the .filter{<expression>} functions. To learn more, and to learn how to rewrite them using Gremlin steps, see A Note on Lambdas.
Transactions aren't supported because of distributed nature of the system. Configure appropriate consistency model on Gremlin account to "read your own writes" and use optimistic concurrency to resolve conflicting writes.

Known limitations

Index utilization for Gremlin queries with mid-traversal .V() steps: Currently, only the first .V() call of a traversal will make use of the index to resolve any filters or predicates attached to it. Subsequent calls will not consult the index, which might increase the latency and cost of the query.

Assuming default indexing, a typical read Gremlin query that starts with the .V() step would use parameters in its attached filtering steps, such as .has() or .where() to optimize the cost and performance of the query. For example:

g.V().has('category', 'A')

However, when more than one .V() step is included in the Gremlin query, the resolution of the data for the query might not be optimal. Take the following query as an example:

g.V().has('category', 'A').as('a').V().has('category', 'B').as('b').select('a', 'b')

This query will return two groups of vertices based on their property called category. In this case, only the first call, g.V().has('category', 'A') will make use of the index to resolve the vertices based on the values of their properties.

A workaround for this query is to use subtraversal steps such as .map() and union(). This is exemplified below:

// Query workaround using .map()
g.V().has('category', 'A').as('a').map(__.V().has('category', 'B')).as('b').select('a','b')

// Query workaround using .union()
g.V().has('category', 'A').fold().union(unfold(), __.V().has('category', 'B'))

You can review the performance of the queries by using the Gremlin executionProfile() step.

Next steps

Get started building a graph application using our SDKs
Learn more about graph support in Azure Cosmos DB