question

AdamPliska-1173 avatar image
0 Votes"
AdamPliska-1173 asked NavtejSaini-MSFT commented

Bulk update 100 million+ cosmos records

I have a large collection of data stored in cosmos, ~100 million rows, but it fluctuates higher/lower over time. I would like to change the ‘schema’ of the data to support better reporting capability. In this case I need to add just a single field, which will be calculated using values from other fields already on the record.

First, I started by writing a small c# console application to load a record, update the record and save the record back to cosmos. I was using batch updates, and the throughput was poor.

Second, I tried to write a cosmos stored procedure, which took the partition key and the record Id. I loaded the record within that stored procedure, made the update, and then saved it. This increased my though put but not enough. With 200k RU’s provisioned I was still looking at over a week of running.

Third, I tried to modify the stored procedure to grab 500 records from the given partition, do the update, save them, and then return. My c# app calls this method over and over, passing a partition key for a range of records that are not yet updated. I am still using 200k RU’s provisioned. Metrics says I’m maxing that out, and I’m hitting perhaps 2000 records per second…. That’s four hits on the stored procedure. The cost of hitting the stored procedure when a full 500 records are updated is about 7k RU. Right now, it’s reporting that it’ll get done at tomorrow at about 4am.

My question is, is there a better way to be doing this sort of transformation? Should I create a new collection, and COPY the data from one to the other? Is Cosmos perhaps not the right technology to be using when mass updates are required?

azure-cosmos-db
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@AdamPliska-1173 We are checking this with our team and will get back to you.

0 Votes 0 ·

@AdamPliska-1173 Please let us know if you need any further help.

0 Votes 0 ·
NavtejSaini-MSFT avatar image
0 Votes"
NavtejSaini-MSFT answered

@AdamPliska-1173

Our team has conveyed the following:

the easiest approach would be to use the Spark connector. You can find
- The quickstart here: azure-sdk-for-java/quick-start.md at master · Azure/azure-sdk-for-java (github.com)
- An end-to-end sample showing how to read/query data as well as update data: azure-sdk-for-java/01_Batch.ipynb at master · Azure/azure-sdk-for-java (github.com)
- And if you need to do it in streaming mode (because the total dataset would require too large of a Spark cluster to handle): azure-sdk-for-java/02_StructuredStreaming.ipynb at master · Azure/azure-sdk-for-java (github.com)

RU-wise updating a document has a higher RU charge than inserting a new document. So from a perspective of minimizing the RUs it would be “cheaper” to insert the documents into a new container. But we assume the effort to do the migration, cut-over with no/minimized downtime would outweigh the RU savings – so I would recommend going with the Spark connector, updating the documents and if necessary restricting the RUs that can be used for the updates (sample above shows that) so that your normal workloads would still work. Duration of the udpates will be a function of RUs you allow for the updates and the size of your Spark cluster (Number of cores for executors mostly)

Regards
Navtej S


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

NavtejSaini-MSFT avatar image
0 Votes"
NavtejSaini-MSFT answered

@AdamPliska-1173

Our Team has also replied:
"Just for a sense of scale, updating 100M records will cost approximately 1B RUs – at least. To complete it in 24 hours, you’d need to be efficiently spending about 12k RU/s. If your documents are larger than 1KB, the cost will go up correspondingly.

I have a rule of thumb that a serial query (single thread, no async programming, no notable network latency, no app processing time) will consume a little over 300 RU/s, so to grow that to 12k RU/s, you’d need to have 40 commands in flight (or more, to compensate for latency, app-side processing, etc) and because it’s exceeding 10k RUs it definitely needs to parallelize across multiple partitions.

There are a couple of things I can think of that would have a +-30% efficiency impact, but at 200k RU/s, it should complete in a little over an hour if the docs are all 1KB or less. If you’re seeing it take a full day at 200k RUs, something is wrong or missing by an order of magnitude. Updating 2000 docs/sec should cost 20k-26k RUs (again, unless they’re very large), not 200k."

Regards
Navtej S

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.