question

BjornDJensen-2053 avatar image
0 Votes"
BjornDJensen-2053 asked PRADEEPCHEEKATLA-MSFT edited

GeoSpatial with SparkSQL/Python in Synapse Spark Pool using apache-sedona?

I would like to run spatial queries on large data sets; e.g. geopandas would be too slow. Inspiration I found here: https://anant-sharma.medium.com/apache-sedona-geospark-using-pyspark-e60485318fbe
But I have trouble registering the spatial functions I would like to use in SparkSQL (or PySpark).

In Spark Pool of Synapse Analytics I prepared (via Azure Portal):
Apache Spark Pool / Settings / Packages / Requirement files / requirement.txt: apache-sedona

Apache Spark Pool / Settings / Packages / Workspace packages:
geotools-wrapper-geotools-24.1.jar
sedona-sql-3.0_2.12-1.2.0-incubating.jar

Apache Spark Pool / Settings / Packages / Spark configuration / config.txt:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator

Pyspark Notebook:

 print(spark.version)
 print(spark.conf.get("spark.kryo.registrator"))
 print(spark.conf.get("spark.serializer"))

Print output from notebook:
3.1.2.5.0-58001107
org.apache.sedona.core.serde.SedonaKryoRegistrator
org.apache.spark.serializer.KryoSerializer


Then trying:

 from pyspark.sql import SparkSession
 from sedona.register import SedonaRegistrator  
 from sedona.utils import SedonaKryoRegistrator, KryoSerializer
 spark = SparkSession.builder.master("local[*]").appName("Sedona App").config("spark.serializer", KryoSerializer.getName).config("spark.kryo.registrator", SedonaKryoRegistrator.getName).getOrCreate()
 SedonaRegistrator.registerAll(spark)


But it failed: Py4JJavaError: An error occurred while calling o636.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo

A simple check that stuff is correctly installed would probaly allow this:

 %%sql
 SELECT ST_Point(0,0);


Please help with getting the spatial functions registered in pyspark running in Synapse notebook!










azure-synapse-analytics
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

PRADEEPCHEEKATLA-MSFT avatar image
1 Vote"
PRADEEPCHEEKATLA-MSFT answered PRADEEPCHEEKATLA-MSFT edited

Hello @BjornDJensen-2053,

Thanks for the question and using MS Q&A platform.

As per the repro from my end, I'm able to successfully run the above commands without any issue.

I just installed the requirement[dot]txt file and downloaded below two jar files:

  • sedona-python-adapter-3.0_2.12–1.0.0-incubating.jar

  • geotools-wrapper-geotools-24.0.jar

Note: config[dot]txt file is not required.

200664-image.png

If you are still facing the same error message, I would request you to share the complete stack trace of the error message which you are experiencing.

Hope this will help. Please let us know if any further queries.


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification

  • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators


image.png (61.8 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

BjornDJensen-2053 avatar image
1 Vote"
BjornDJensen-2053 answered PRADEEPCHEEKATLA-MSFT commented

Turns out I used the wrong jar... :-0
I can continue now. Thanks for helping!
But let me know if you have an hint about how to get the total list of available spatial functions in the particular spark session.


Here a refined version that seems to work (-:

Uploading workspace packages (2 jar’s) in Synapse Studio / Manage / Configuration+libraries/Workspace packages:
geotools-wrapper-geotools-24.1.jar (downloaded from https://mvnrepository.com/artifact/org.datasyslab/geotools-wrapper/geotools-24.1 )
sedona-python-adapter-3.0_2.12-1.0.0-incubating.jar (downloaded from https://search.maven.org/artifact/org.apache.sedona/sedona-python-adapter-3.0_2.12/1.0.0-incubating/jar )

Then in
Apache Spark Pool / Settings / Packages / Workspace packages : selecting the above workspace packages

Uploading txt file:
Apache Spark Pool / Settings / Packages / Requirement files / requirements.txt : apache-sedona

Further uploading config.txt:
Apache Spark Pool / Settings / Packages / Spark configuration / config.txt:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator



The above configuration stuff allows me write fewer lines:

 from sedona.register import SedonaRegistrator  
 from sedona.utils import SedonaKryoRegistrator, KryoSerializer
 SedonaRegistrator.registerAll(spark)
 print(spark.version)
 print(spark.conf.get("spark.kryo.registrator"))
 print(spark.conf.get("spark.serializer"))


And now the fun starts:

 %%sql
 SELECT st_point(0.0,0.0);








· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @BjornDJensen-2053,

Glad to know that your issue has been resolved. And thanks for sharing the solution, which might be beneficial to other community members reading this thread.

1 Vote 1 ·