Access Azure Cosmos DB for Apache Cassandra data from Azure Databricks

Article
10/12/2022

APPLIES TO: Cassandra

This article details how to work with Azure Cosmos DB for Apache Cassandra from Spark on Azure Databricks.

Prerequisites

Provision an Azure Cosmos DB for Apache Cassandra account
Review the basics of connecting to Azure Cosmos DB for Apache Cassandra
Provision an Azure Databricks cluster
Review the code samples for working with API for Cassandra
Use cqlsh for validation if you so prefer
API for Cassandra instance configuration for Cassandra connector:

The connector for API for Cassandra requires the Cassandra connection details to be initialized as part of the spark context. When you launch a Databricks notebook, the spark context is already initialized, and it isn't advisable to stop and reinitialize it. One solution is to add the API for Cassandra instance configuration at a cluster level, in the cluster spark configuration. It's one-time activity per cluster. Add the following code to the Spark configuration as a space separated key value pair:
```
spark.cassandra.connection.host YOUR_COSMOSDB_ACCOUNT_NAME.cassandra.cosmosdb.azure.com
spark.cassandra.connection.port 10350
spark.cassandra.connection.ssl.enabled true
spark.cassandra.auth.username YOUR_COSMOSDB_ACCOUNT_NAME
spark.cassandra.auth.password YOUR_COSMOSDB_KEY
```

Add the required dependencies

Cassandra Spark connector: - To integrate Azure Cosmos DB for Apache Cassandra with Spark, the Cassandra connector should be attached to the Azure Databricks cluster. To attach the cluster:
- Review the Databricks runtime version, the Spark version. Then find the maven coordinates that are compatible with the Cassandra Spark connector, and attach it to the cluster. See "Upload a Maven package or Spark package" article to attach the connector library to the cluster. We recommend selecting Databricks runtime version 10.4 LTS, which supports Spark 3.2.1. To add the Apache Spark Cassandra Connector, your cluster, select Libraries > Install New > Maven, and then add com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0 in Maven coordinates. If using Spark 2.x, we recommend an environment with Spark version 2.4.5, using spark connector at maven coordinates com.datastax.spark:spark-cassandra-connector_2.11:2.4.3.
Azure Cosmos DB for Apache Cassandra-specific library: - If you're using Spark 2.x, a custom connection factory is required to configure the retry policy from the Cassandra Spark connector to Azure Cosmos DB for Apache Cassandra. Add the com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0maven coordinates to attach the library to the cluster.

Note

If you are using Spark 3.x, you do not need to install the Azure Cosmos DB for Apache Cassandra-specific library mentioned above.

Warning

The Spark 3 samples shown in this article have been tested with Spark version 3.2.1 and the corresponding Cassandra Spark Connector com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0. Later versions of Spark and/or the Cassandra connector may not function as expected.

Sample notebooks

A list of Azure Databricks sample notebooks is available in GitHub repo for you to download. These samples include how to connect to Azure Cosmos DB for Apache Cassandra from Spark and perform different CRUD operations on the data. You can also import all the notebooks into your Databricks cluster workspace and run it.

Accessing Azure Cosmos DB for Apache Cassandra from Spark Scala programs

Spark programs to be run as automated processes on Azure Databricks are submitted to the cluster by using spark-submit) and scheduled to run through the Azure Databricks jobs.

The following are links to help you get started building Spark Scala programs to interact with Azure Cosmos DB for Apache Cassandra.

Next steps

Get started with creating a API for Cassandra account, database, and a table by using a Java application.