question

RajD-9527 avatar image
0 Votes"
RajD-9527 asked WintersteinMarine-6222 answered

push data from dataframe to json doc

Hi, I am using below code in python to read data from a SQL table and copy results in a dataframe then push the results into a json document and save it in Azure Data Lake Storage Gen2.

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview

     jdbcHostname = "hostname"
     jdbcDatabase = "databasename"
     jdbcPort = 1413
     jdbcUsername = "username"
     jdbcPassword = "password"
     
     jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
     connectionProperties = {
       "user" : jdbcUsername,
       "password" : jdbcPassword,
       "driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
     }
     
     pushdown_query = "(select * from hr.employee) emp"
     df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
     display(df)
     df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/hr/emp")

The above code displays data in dataframe but does not create the folder and the json document. Could you please where I am doing it wrong.

Thank you

azure-databricksazure-data-lake-storage
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

PRADEEPCHEEKATLA-MSFT avatar image
1 Vote"
PRADEEPCHEEKATLA-MSFT answered PRADEEPCHEEKATLA-MSFT commented

Hello @RajD-9527,

I'm able to push data from dataframe to json with the same code which you are tried.

18597-image.png

Here is the JSON document which is written to Storage account:

18639-image.png

If the above method is not working for you, you can try the below method:

Step1: Configure the storage account path

 spark.conf.set("fs.azure.account.key.chepra.blob.core.windows.net", "gvXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXdlOiA==")
 output_container_path = "wasbs://sampledata@chepra.blob.core.windows.net/Json_data_folder"

Step2: Use the below code snippet for writing sql table result to JSON document

 df.coalesce(1).write.format("org.apache.spark.sql.json").mode("overwrite").save(output_container_path)

18566-image.png

Here is the JSON document which is written to Storage account:

18567-image.png

Hope this helps. Do let us know if you any further queries.


Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.



image.png (98.6 KiB)
image.png (94.3 KiB)
image.png (108.9 KiB)
image.png (97.2 KiB)
· 8
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi PRADEEPCHEEKATLA-MSFT, Thanks very much for your response. The python code works for some reason is working intermittently. The scala code snippet throws error: not found: value df. Also, the table data I am trying to save as a json document has over 25 million + records and when saved to the container is 42 gb when I click on view/edit I see this File size of '42.59GB' exceeds max supported file size of '2.1MB.'

I assume this is just for viewing the raw data in the container but I could still work with the file. Also, is there a best practice where a file could be stored in a container or a file share? Please help me understand.

Regards

0 Votes 0 ·

    org.apache.spark.SparkException: Job aborted.
    ---------------------------------------------------------------------------
    Py4JJavaError                             Traceback (most recent call last)
    <command-2802701877950826> in <module>
          9 df=spark.createDataFrame([Row(**i) for i in data])
         10 df.show()
    ---> 11 df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/hr/emp")
   
    /databricks/spark/python/pyspark/sql/readwriter.py in json(self, path, mode, compression, dateFormat, timestampFormat, lineSep, encoding)
        815             compression=compression, dateFormat=dateFormat, timestampFormat=timestampFormat,
        816             lineSep=lineSep, encoding=encoding)
    --> 817         self._jwrite.json(path)
        818
        819     @since(1.4)

0 Votes 0 ·

Hello @RajD-9527,

The above is python code snippet.

There is a limit of editing file in Azure portal. You can just edit a file in portal which size is under 2.05MB. If the file size is above you will receive this error message "`exceeds max supported file size of '2.1MB.'`".

Always best practice to store the data in containers.

Could you please share the complete stack trace of the error message which you are experiencing?




0 Votes 0 ·

Hi PRADEEPCHEEKATLA-MSFT, I apologize unable to post comments or reply correctly.

Thank you

0 Votes 0 ·

Hello @RajD-9527,

Yes, there is a known issue on Microsoft Q&A site to post comment below 1000 characters. This will be fixed soon.

Could you please share the complete stack trace of the error message which you are experiencing (try to post it by edit the answer below)?

0 Votes 0 ·
RajD-9527 avatar image RajD-9527 PRADEEPCHEEKATLA-MSFT ·

Hi PRADEEPCHEEKATLA-MSFT, Thanks very much for your response. Please find the job aborted complete stack posted as an answer below. org.apache.spark.SparkException: Job aborted.

0 Votes 0 ·
Show more comments
RajD-9527 avatar image
0 Votes"
RajD-9527 answered RajD-9527 published

Hi PRADEEPCHEEKATLA-MSFT, For some reason run into error every once in a while. org.apache.spark.SparkException: Job aborted.

     org.apache.spark.SparkException: Job aborted.                
     ---------------------------------------------------------------------------
     Py4JJavaError                             Traceback (most recent call last)
     <command-2802701877950826> in <module>
           9 df=spark.createDataFrame([Row(**i) for i in data])
          10 df.show()
     ---> 11 df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/hr/emp")
     
     /databricks/spark/python/pyspark/sql/readwriter.py in json(self, path, mode, compression, dateFormat, timestampFormat, lineSep, encoding)
         815             compression=compression, dateFormat=dateFormat, timestampFormat=timestampFormat,
         816             lineSep=lineSep, encoding=encoding)
     --> 817         self._jwrite.json(path)
         818 
         819     @since(1.4)
     
     /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
        1255         answer = self.gateway_client.send_command(command)
        1256         return_value = get_return_value(
     -> 1257             answer, self.gateway_client, self.target_id, self.name)
        1258 
        1259         for temp_arg in temp_args:



5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

WintersteinMarine-6222 avatar image
2 Votes"
WintersteinMarine-6222 answered

Hi PRADEEPCHEEKATLA-MSFT, this is almost the same solution I encountered in Data Science from Scratch



      org.apache.spark.SparkException: Job aborted.                
      ---------------------------------------------------------------------------
      Py4JJavaError                             Traceback (most recent call last)
      <command-2802701877950826> in <module>
            9 df=spark.createDataFrame([Row(**i) for i in data])
           10 df.show()
      ---> 11 df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/hr/emp")
         
      /databricks/spark/python/pyspark/sql/readwriter.py in json(self, path, mode, compression, dateFormat, timestampFormat, lineSep, encoding)
          815             compression=compression, dateFormat=dateFormat, timestampFormat=timestampFormat,
          816             lineSep=lineSep, encoding=encoding)
      --> 817         self._jwrite.json(path)
          818 
          819     @since(1.4)
         
      /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
         1255         answer = self.gateway_client.send_command(command)
         1256         return_value = get_return_value(
      -> 1257             answer, self.gateway_client, self.target_id, self.name)
         1258 
         1259         for temp_arg in temp_args:
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.