question

SurendiranBalasubramanian-5094 avatar image
0 Votes"
SurendiranBalasubramanian-5094 asked HimanshuSinha-MSFT commented

I need to write a data into json file like in the below format using pyspark.

{
"list-item": [
{"author":"author1","title":"title1","pages":1,"email":"author1@gmail.com"},
{"author":"author2","title":"title2","pages":2,"email":"author2@gmail.com"},
{"author":"author3","title":"title3","pages":3,"email":"author3@gmail.com"},
{"author":"author4","title":"title4","pages":4,"email":"author4@gmail.com"},
],
"version": 1
}

I have written the below pyspark code but it write "" and adding "" at the beginning and end of each item. How to remove the backslash and double quote

import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col,to_json,struct,collect_list,lit
from datetime import datetime
from time import time
if name == 'main':
spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()


 schema = StructType([ 
     StructField("author", StringType(), False), 
     StructField("title", StringType(), False), 
     StructField("pages", IntegerType(), False), 
     StructField("email", StringType(), False) 
 ]) 
 data = [ 
     ["author1", "title1", 1, "author1@gmail.com"], 
     ["author2", "title2", 2, "author2@gmail.com"], 
     ["author3", "title3", 3, "author3@gmail.com"], 
     ["author4", "title4", 4, "author4@gmail.com"] 
 ] 
 
 df = spark.createDataFrame(data, schema) 
 df=df.select(to_json(struct("author", "title", "pages", "email")).alias("json-data")).agg(collect_list("json-data").alias("list-item")) 
 df=df.withColumn("version",lit("1.0").cast(IntegerType())) 
 df.printSchema() 
 df.show(2, False) 
 curDT = datetime.now() 
 
 targetPath = curDT.strftime("%m-%d-%Y-%H-%M-%S") 
 df.write.format("json").mode("overwrite").option("escape", "").save(targetPath) 


my code writes the json with backslash and double quote enclosed each item like below.how to remove those.Please help

{"list-item":["{\"author\":\"author1\",\"title\":\"title1\",\"pages\":1,\"email\":\"author1@gmail.com\"}","{\"author\":\"author2\",\"title\":\"title2\",\"pages\":2,\"email\":\"author2@gmail.com\"}","{\"author\":\"author3\",\"title\":\"title3\",\"pages\\":3,\"email\":\"author3@gmail.com\\"}","{\"author\":\"author4\",\"title\":\"title4\",\"pages\":4,\"email\":\"author4@gmail.com\"}"],"version":1}

azure-databricksdotnet-ml-big-data
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @SurendiranBalasubramanian-5094,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu

0 Votes 0 ·

Hello @SurendiranBalasubramanian-5094,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread .
Thanks
Himanshu

0 Votes 0 ·

1 Answer

HimanshuSinha-MSFT avatar image
0 Votes"
HimanshuSinha-MSFT answered HimanshuSinha-MSFT edited

Hello @SurendiranBalasubramanian-5094,
Thanks for the question and using MS Q&A platform.

As we understand the ask here is to remove the "\" from the JSON , please do let us know if its not accurate.
You need to remove the to_json function and it should work fine .
Existing code

df.select(to_json(struct("author", "title", "pages", "email")).alias("json-data")).agg(collect_list("json-data").alias("list-item"))

Update this to

df=df.select(struct("author", "title", "pages", "email").alias("json-data")).agg(collect_list("json-data").alias("list-item"))

I have tested the output and looks like this .

{"list-item":[{"author":"author1","title":"title1","pages":1,"email":"author1@gmail.com"},{"author":"author2","title":"title2","pages":2,"email":"author2@gmail.com"},{"author":"author3","title":"title3","pages":3,"email":"author3@gmail.com"},{"author":"author4","title":"title4","pages":4,"email":"author4@gmail.com"}],"version":1}




Please do let me if you have any queries.
Thanks
Himanshu


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification

  • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators



5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.