Hi all,
We are using at work Databricks (with Spark Streaming) + Azure Blob Storage + Delta Table to process a streaming from our on premise infrastructure to have an online (just seconds/minutes behind) replica of our on premise databases (SQL Server).
These on premise databases have frequent but small upserts (tens of records / minute), so everytime we get to process a micro batch with Spark to do an upsert into the cloud delta lake, we just get, in general, no more than 100 records to upsert, most of the time, just something between 1 to 10 records per microbatch.
What we have found is that each upsert in the delta lake, takes, at least, 5s. If a micro batch has 1 row, it takes 5s, if it has 10 rows, 5s, if it has 10k rows, also 5s. Above 10k rows, we start seeing higher times (for instance a 100k records microbatch takes 15s to be processed and upsert into the delta lake).
My first question is if it is normal to expect a minium time of 5s per operation against a delta table (we have the delta tables optimized according to Delta best practice) and also we did a test of deleting everything, starting all over again, and we still had 5s per microbatch when we were doing a 1 record upsert into a deltatable that had only 10 records (one parquet file, a few kb in size).
If the answer is no, and we should expect lower times (I'd expect <1s), then what do you suggest we start looking for? We tried moving from Standard to Premium storage accounts and the result is the same.