I need to write a data into json file like in the below format using pyspark.

Surendiran Balasubramanian 1 Reputation point
2022-05-30T19:32:06.1+00:00

{
"list-item": [
{"author":"author1","title":"title1","pages":1,"email":"author1@Stuff .com"},
{"author":"author2","title":"title2","pages":2,"email":"author2@Stuff .com"},
{"author":"author3","title":"title3","pages":3,"email":"author3@Stuff .com"},
{"author":"author4","title":"title4","pages":4,"email":"author4@Stuff .com"},
],
"version": 1
}

I have written the below pyspark code but it write "" and adding "" at the beginning and end of each item. How to remove the backslash and double quote

import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col,to_json,struct,collect_list,lit
from datetime import datetime
from time import time
if name == 'main':
spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()

schema = StructType([   
    StructField("author", StringType(), False),   
    StructField("title", StringType(), False),   
    StructField("pages", IntegerType(), False),   
    StructField("email", StringType(), False)   
])   
data = [   
    ["author1", "title1", 1, "author1@gmail.com"],   
    ["author2", "title2", 2, "author2@gmail.com"],   
    ["author3", "title3", 3, "author3@gmail.com"],   
    ["author4", "title4", 4, "author4@gmail.com"]   
]   

df = spark.createDataFrame(data, schema)   
df=df.select(to_json(struct("author", "title", "pages", "email")).alias("json-data")).agg(collect_list("json-data").alias("list-item"))   
df=df.withColumn("version",lit("1.0").cast(IntegerType()))   
df.printSchema()   
df.show(2, False)   
curDT = datetime.now()   

targetPath = curDT.strftime("%m-%d-%Y-%H-%M-%S")   
df.write.format("json").mode("overwrite").option("escape", "").save(targetPath)   

my code writes the json with backslash and double quote enclosed each item like below.how to remove those.Please help

{"list-item":["{\"author\":\"author1\",\"title\":\"title1\",\"pages\":1,\"email\":\"author1@Stuff .com\"}","{\"author\":\"author2\",\"title\":\"title2\",\"pages\":2,\"email\":\"author2@Stuff .com\"}","{\"author\":\"author3\",\"title\":\"title3\",\"pages\":3,\"email\":\"author3@Stuff .com\"}","{\"author\":\"author4\",\"title\":\"title4\",\"pages\":4,\"email\":\"author4@Stuff .com\"}"],"version":1}

.NET
.NET
Microsoft Technologies based on the .NET software framework.
3,264 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,890 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,376 Reputation points Microsoft Employee
    2022-05-31T21:25:14.03+00:00

    Hello @Surendiran Balasubramanian ,
    Thanks for the question and using MS Q&A platform.

    As we understand the ask here is to remove the "\" from the JSON , please do let us know if its not accurate.
    You need to remove the to_json function and it should work fine .
    Existing code

    df.select(to_json(struct("author", "title", "pages", "email")).alias("json-data")).agg(collect_list("json-data").alias("list-item"))

    Update this to

    df=df.select(struct("author", "title", "pages", "email").alias("json-data")).agg(collect_list("json-data").alias("list-item"))

    I have tested the output and looks like this .

    {"list-item":[{"author":"author1","title":"title1","pages":1,"email":"author1@Stuff .com"},{"author":"author2","title":"title2","pages":2,"email":"author2@Stuff .com"},{"author":"author3","title":"title3","pages":3,"email":"author3@Stuff .com"},{"author":"author4","title":"title4","pages":4,"email":"author4@Stuff .com"}],"version":1}

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators