I have a large collection of data stored in cosmos, ~100 million rows, but it fluctuates higher/lower over time. I would like to change the ‘schema’ of the data to support better reporting capability. In this case I need to add just a single field, which will be calculated using values from other fields already on the record.
First, I started by writing a small c# console application to load a record, update the record and save the record back to cosmos. I was using batch updates, and the throughput was poor.
Second, I tried to write a cosmos stored procedure, which took the partition key and the record Id. I loaded the record within that stored procedure, made the update, and then saved it. This increased my though put but not enough. With 200k RU’s provisioned I was still looking at over a week of running.
Third, I tried to modify the stored procedure to grab 500 records from the given partition, do the update, save them, and then return. My c# app calls this method over and over, passing a partition key for a range of records that are not yet updated. I am still using 200k RU’s provisioned. Metrics says I’m maxing that out, and I’m hitting perhaps 2000 records per second…. That’s four hits on the stored procedure. The cost of hitting the stored procedure when a full 500 records are updated is about 7k RU. Right now, it’s reporting that it’ll get done at tomorrow at about 4am.
My question is, is there a better way to be doing this sort of transformation? Should I create a new collection, and COPY the data from one to the other? Is Cosmos perhaps not the right technology to be using when mass updates are required?