Hi All,
My requirement was to process approx 1TB of data stored in Azure container.The container contains millions of json files which are multi part in nature .
For this i was using HdInsight which was able to process the data in 45 mins approx :
Worker Nodes (1-4)autoscale - 16 cores 112 gb
Headnodes-2 - 4 cores 28gb
we planned to migrate to Azure Databricks Spark cluster
configuration of cluster used
Worker Nodes (4-10) autoscale - 8 cores 56gb - memory optimized
Head nodes - 4 cores 28gb
But this keeps running for more then 2.5 hrs but still the process was not completed, and i can see it used 4 worker nodes to the maximum but does not scale up to leverage the remaining worker nodes to speed up the process.
Can any one help if i am doing something wrong here.