synapse spark pool - pyspark load a subset of xml files from given folder

Question

I am limited to use 4 vCores / 32 GB - 3 to 10 nodes configuration of apache spark pool.
I am trying to load all xml files from given folder with below code:
spark.read.format("com.databricks.spark.xml").option("rowTag","Quality").load("/mnt/dev/tmp/xml/100_file/M*.xml")

But the number of files in the folder is more than thousands and my small synapse spark pool with 32gb ram is not able to handle so many files efficiently. So, what I want is to read only 1st 100 files in 1st round. then next 100 files and so on..

is there any api function that allows me to do this?

Accepted Answer

Hi @Dheeraj ,

You can use limit(100) function to get 100 files to achieve the same.
Please let me know if you have any other questions.

Thanks
Saurabh

Answer

Hi,
I realize this is a bit late in this thread, but I'm struggling to get that line working where you load the xml files:
spark.read.format("com.databricks.spark.xml").option("rowTag","Quality").load("/mnt/dev/tmp/xml/100_file/M*.xml")
How did you install the package in Synapse Analytics spark pool that supports this format? I'm getting this error:
"java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml"
Regards,
LJ

synapse spark pool - pyspark load a subset of xml files from given folder

1 additional answer