Azure-DataBricks Spark not performing

Debasish22 1 Reputation point
2022-04-14T15:47:52.677+00:00

Hi All,

My requirement was to process approx 1TB of data stored in Azure container.The container contains millions of json files which are multi part in nature .

For this i was using HdInsight which was able to process the data in 45 mins approx :

Worker Nodes (1-4)autoscale - 16 cores 112 gb
Headnodes-2 - 4 cores 28gb

we planned to migrate to Azure Databricks Spark cluster

configuration of cluster used

Worker Nodes (4-10) autoscale - 8 cores 56gb - memory optimized
Head nodes - 4 cores 28gb

But this keeps running for more then 2.5 hrs but still the process was not completed, and i can see it used 4 worker nodes to the maximum but does not scale up to leverage the remaining worker nodes to speed up the process.

Can any one help if i am doing something wrong here.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,938 questions
{count} votes