Access Azure Cosmos DB for Apache Cassandra data from Azure Databricks

APPLIES TO: Cassandra

This article details how to work with Azure Cosmos DB for Apache Cassandra from Spark on Azure Databricks.

Prerequisites

Add the required dependencies

  • Cassandra Spark connector: - To integrate Azure Cosmos DB for Apache Cassandra with Spark, the Cassandra connector should be attached to the Azure Databricks cluster. To attach the cluster:

    • Review the Databricks runtime version, the Spark version. Then find the maven coordinates that are compatible with the Cassandra Spark connector, and attach it to the cluster. See "Upload a Maven package or Spark package" article to attach the connector library to the cluster. We recommend selecting Databricks runtime version 10.4 LTS, which supports Spark 3.2.1. To add the Apache Spark Cassandra Connector, your cluster, select Libraries > Install New > Maven, and then add com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0 in Maven coordinates. If using Spark 2.x, we recommend an environment with Spark version 2.4.5, using spark connector at maven coordinates com.datastax.spark:spark-cassandra-connector_2.11:2.4.3.
  • Azure Cosmos DB for Apache Cassandra-specific library: - If you're using Spark 2.x, a custom connection factory is required to configure the retry policy from the Cassandra Spark connector to Azure Cosmos DB for Apache Cassandra. Add the com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0maven coordinates to attach the library to the cluster.

Note

If you are using Spark 3.x, you do not need to install the Azure Cosmos DB for Apache Cassandra-specific library mentioned above.

Warning

The Spark 3 samples shown in this article have been tested with Spark version 3.2.1 and the corresponding Cassandra Spark Connector com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0. Later versions of Spark and/or the Cassandra connector may not function as expected.

Sample notebooks

A list of Azure Databricks sample notebooks is available in GitHub repo for you to download. These samples include how to connect to Azure Cosmos DB for Apache Cassandra from Spark and perform different CRUD operations on the data. You can also import all the notebooks into your Databricks cluster workspace and run it.

Accessing Azure Cosmos DB for Apache Cassandra from Spark Scala programs

Spark programs to be run as automated processes on Azure Databricks are submitted to the cluster by using spark-submit) and scheduled to run through the Azure Databricks jobs.

The following are links to help you get started building Spark Scala programs to interact with Azure Cosmos DB for Apache Cassandra.

Next steps

Get started with creating a API for Cassandra account, database, and a table by using a Java application.