synapse spark pool - pyspark load a subset of xml files from given folder

Dheeraj 351 Reputation points
2021-07-18T07:13:52.333+00:00

I am limited to use 4 vCores / 32 GB - 3 to 10 nodes configuration of apache spark pool.
I am trying to load all xml files from given folder with below code:
spark.read.format("com.databricks.spark.xml").option("rowTag","Quality").load("/mnt/dev/tmp/xml/100_file/M*.xml")

But the number of files in the folder is more than thousands and my small synapse spark pool with 32gb ram is not able to handle so many files efficiently. So, what I want is to read only 1st 100 files in 1st round. then next 100 files and so on..

is there any api function that allows me to do this?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,357 questions
{count} votes

Accepted answer
  1. Saurabh Sharma 23,671 Reputation points Microsoft Employee
    2021-07-20T17:58:49.377+00:00

    Hi @Dheeraj ,

    You can use limit(100) function to get 100 files to achieve the same.
    Please let me know if you have any other questions.

    Thanks
    Saurabh


1 additional answer

Sort by: Most helpful
  1. Ljubo Jurkovic 41 Reputation points
    2022-07-26T20:59:43.153+00:00

    Hi,
    I realize this is a bit late in this thread, but I'm struggling to get that line working where you load the xml files:
    spark.read.format("com.databricks.spark.xml").option("rowTag","Quality").load("/mnt/dev/tmp/xml/100_file/M*.xml")
    How did you install the package in Synapse Analytics spark pool that supports this format? I'm getting this error:
    "java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml"
    Regards,
    LJ

    0 comments No comments