Exercise - Create and populate your Azure Cosmos DB collections
In this unit, you'll create an Azure Cosmos DB account and use a console application to populate the database.
Create your database account
A database account is a container for multiple Azure Cosmos DB databases.
Add a unique name for your database account. This name must be unique across all Azure Cosmos DB instances. Run the following command to generate a random database account name using the Bash
$RANDOM
variable and store it in an environment variable to use later.export COSMOS_NAME=cosmos$RANDOM
A sandbox resource group has been created for you. Run the following command to have it stored in an environment variable that you'll use for the rest of the code samples in this exercise.
export RESOURCE_GROUP=$(az group list | jq -r '.[0].name')
Note
If you were using your own Azure account instead of the sandbox, you would configure a static resource name in this variable.
Create an Azure Cosmos DB account running the following command.
az cosmosdb create \ --resource-group $RESOURCE_GROUP \ --name $COSMOS_NAME
The database account can take up to 10 minutes to provision. You can continue reading this unit while the account is being created.
Azure Cosmos DB concepts
Azure Cosmos DB concepts consist of:
- Resources
- Partitioning
- Indexing
Resources
An Azure Cosmos DB account is a container for one or more databases. An Azure Cosmos DB database is a container for one or more collections. A collection contains documents. A document is an unstructured set of key/value pairs, read and written in JSON format.
Partitioning
Partitioning is the distribution and grouping of your data across the underlying resources. Documents are grouped in a partition based on the value of the partition key. You specify the partition key when you create the collection. To better understand the concept of partition keys, let's review the property and values in the following JSON example document.
{
"OrderTime": "4:21 PM",
"id": "e152c6b5-2d9b-f232-6f58-de17190ecfec",
"OrderStatus": "NEW",
"Item": {
"id": "52841410-7500-828d-e932-66f166e6f87f",
"Title": "8vlyf0jyvexfv3f",
"Category": "Books",
"UPC": "74:8585:249",
"Website": "https://pzr.khdftcp.com",
"ReleaseDate": "2019-01-10T16:21:41.039088-08:00",
"Condition": "NEW",
"Merchant": "24/7",
"ListPricePerItem": 13.41,
"PurchasePrice": 12.76,
"Currency": "USD"
},
"Quantity": 24,
"PaymentInstrumentType": 1,
"PurchaseOrderNumber": "422-40277-87",
"Customer": {
"id": "297b2be8-f31f-cd84-4013-5a9a13a8aae6",
"FirstName": "Clair",
"LastName": "Weber",
"Email": "Clair.Weber@yahoo.com",
"StreetAddress": "594 Stoltenberg Divide",
"ZipCode": "93267-7740",
"State": "VT"
},
"ShippingDate": "2019-01-16T16:21:41.65195-08:00",
"Data": "2FgYt+9u0FiL4Q=="
}
Any of these properties, or a combination of them, can be a partition key. For example, where you defined the partition key as a combination of the properties Category and Merchant, any documents that have matching values for Category and Merchant are grouped in the same partition.
An effective partitioning strategy distributes data and access evenly across partitions and across time. Querying documents from within the same partition is less expensive than querying across partitions.
You can choose how to partition your data at design time. The partitioning configuration can't be changed after a collection is provisioned.
We examine partitioning concepts and examples in detail in subsequent units.
Indexing
An index is a catalog of document properties and their values. It includes links to documents that contain properties equal to each property value. Indexing makes searching a collection more efficient. However, the search efficiency is balanced with the resources required to insert or change a document. When a document is inserted or changed, Azure Cosmos DB has to update the index. The optimal indexing strategy for your collection depends on your workload.
Unlike partitioning, you can change indexing at runtime.
We'll look at indexing in subsequent units.
Set environment variables for endpoint and keys
After the database is created, run the following command to store its endpoint in an environment variable.
export ENDPOINT=$(az cosmosdb list \ --resource-group $RESOURCE_GROUP \ --output tsv \ --query [0].documentEndpoint)
Run the following command to store the access key in an environment variable.
export KEY=$(az cosmosdb keys list \ --resource-group $RESOURCE_GROUP \ --name $COSMOS_NAME \ --output tsv \ --query primaryMasterKey)
Create your database and collections
Run the following command to create a database called
mslearn
in your Azure Cosmos DB account. We need only one database for these exercises.az cosmosdb sql database create \ --resource-group $RESOURCE_GROUP \ --account-name $COSMOS_NAME \ --name mslearn
Create the first collection by running the following command.
We're going to create three collections to compare different partitioning strategies and workloads.
We'll allocate a smaller capacity to this collection to demonstrate overloading it. The partition key for this collection is the unique identifier of the order. In this case, the partition isn't important, because the collection is smaller than a single partition. In addition, this first collection is configured for 400 request units per second (RU/s), which is less than the next two collections.
az cosmosdb sql container create \ --resource-group $RESOURCE_GROUP \ --account-name $COSMOS_NAME \ --database-name mslearn \ --name Small \ --partition-key-path /id \ --throughput 400
Create the second collection by running the following command.
This collection uses an order item's product category as the partition key. We'll explore the consequences of this choice as we go through the exercises in this module. This second collection is configured for 7000 RU/s, which is more than the first collection.
az cosmosdb sql container create \ --resource-group $RESOURCE_GROUP \ --account-name $COSMOS_NAME \ --database-name mslearn \ --name HotPartition \ --partition-key-path /Item/Category \ --throughput 7000
Create the third collection by running the following command.
This collection partitions the documents by the order item's unique product identifier. This last collection is also configured for 7000 RU/s.
az cosmosdb sql container create \ --resource-group $RESOURCE_GROUP \ --account-name $COSMOS_NAME \ --database-name mslearn \ --name Orders \ --partition-key-path /Item/id \ --throughput 7000
Populate your collections
We'll use an open-source C# console application to populate your collections. This application generates random order documents and inserts them into your collections. We'll also use this application in subsequent units to query the collections.
Clone the console application repository from GitHub. Run the following command in the sandbox environment.
git clone https://github.com/MicrosoftDocs/mslearn-monitor-azure-cosmos-db
Change into the application's directory by running the following command.
cd mslearn-monitor-azure-cosmos-db/ExerciseCosmosDB
Check your environment variables. The console application needs the environment variables to connect to the database. If Azure Cloud Shell times out, you need to set these and the
COSMOS_NAME
variable again. You can reset yourCOSMOS_NAME
,RESOURCE_GROUP
,ENDPOINT
andKEY
variables by running the following commands.export COSMOS_NAME=$(az cosmosdb list --output tsv --query [0].name) export RESOURCE_GROUP=$(az group list | jq -r '.[0].name') export ENDPOINT=$(az cosmosdb list \ --resource-group $RESOURCE_GROUP \ --output tsv \ --query [0].documentEndpoint) export KEY=$(az cosmosdb keys list \ --resource-group $RESOURCE_GROUP \ --name $COSMOS_NAME \ --output tsv \ --query primaryMasterKey)
Populate the
Small
collection by running the following command.dotnet run -- -c Small -o InsertDocument -n 4000 -p 10
Again, the application takes a few minutes to run. We need to populate the database with enough data that we can discern metrics for different partitioning and indexing strategies. To populate this collection, the console application is running with the following options:
Option Value Description -c Small Name of the collection to use. -o InsertDocument Name of the task to run. -n 4000 Number of times to run. -p 10 Degree of parallelism to use. That's the number of threads used for the experiment. The higher this number, the greater the demand on the collection. The first time you run the application, it shows a welcome message.
You can see the other options for this application by running
dotnet run -- --help
.While the console application runs, you see one line printed per second that shows the status and RUs needed for the database writes.
Populate the
HotPartition
collection by running the following command.dotnet run -- -c HotPartition -o InsertDocument -n 20000 -p 10
Populate the
Orders
collection by running the following command.dotnet run -- -c Orders -o InsertDocument -n 20000 -p 10
Notice that the throughput changes for each of the different collections; the data populates the Small
collection at a slower rate than the remaining collections because it was configured to use 400 RU/s, whereas the HotPartition
and Orders
were configured for 7000 RU/s.
Need help? See our troubleshooting guide or provide specific feedback by reporting an issue.