Access Azure Cosmos DB Cassandra API data from Azure Databricks

APPLIES TO: Cassandra API

This article details how to work with Azure Cosmos DB Cassandra API from Spark on Azure Databricks.

Prerequisites

Add the required dependencies

  • Cassandra Spark connector: - To integrate Azure Cosmos DB Cassandra API with Spark, the Cassandra connector should be attached to the Azure Databricks cluster. To attach the cluster:

    • Review the Databricks runtime version, the Spark version. Then find the maven coordinates that are compatible with the Cassandra Spark connector, and attach it to the cluster. See "Upload a Maven package or Spark package" article to attach the connector library to the cluster. We recommend selecting Databricks runtime version 7.5, which supports Spark 3.0. To add the Apache Spark Cassandra Connector, your cluster, select Libraries > Install New > Maven, and then add com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0 in Maven coordinates. If using Spark 2.x, we recommend an environment with Spark version 2.4.5, using spark connector at maven coordinates com.datastax.spark:spark-cassandra-connector_2.11:2.4.3.
  • Azure Cosmos DB Cassandra API-specific library: - If you are using Spark 2.x, a custom connection factory is required to configure the retry policy from the Cassandra Spark connector to Azure Cosmos DB Cassandra API. Add the com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0maven coordinates to attach the library to the cluster.

Note

If you are using Spark 3.0, you do not need to install the Cosmos DB Cassandra API-specific library mentioned above.

Warning

The Spark 3 samples shown in this article have been tested with Spark version 3.0.1 and the corresponding Cassandra Spark Connector com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0. Later versions of Spark and/or the Cassandra connector may not function as expected.

Sample notebooks

A list of Azure Databricks sample notebooks are available in GitHub repo for you to download. These samples include how to connect to Azure Cosmos DB Cassandra API from Spark and perform different CRUD operations on the data. You can also import all the notebooks into your Databricks cluster workspace and run it.

Accessing Azure Cosmos DB Cassandra API from Spark Scala programs

Spark programs to be run as automated processes on Azure Databricks are submitted to the cluster by using spark-submit) and scheduled to run through the Azure Databricks jobs.

The following are links to help you get started building Spark Scala programs to interact with Azure Cosmos DB Cassandra API.

Next steps

Get started with creating a Cassandra API account, database, and a table by using a Java application.